Re: Auto-shutdown for EC2 clusters

2008-11-26 Thread Tom White
I've just created a basic script to do something similar for running a
benchmark on EC2. See
https://issues.apache.org/jira/browse/HADOOP-4382. As it stands the
code for detecting when Hadoop is ready to accept jobs is simplistic,
to say the least, so any ideas for improvement would be great.

Thanks,
Tom

On Fri, Oct 24, 2008 at 11:53 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
>
> fyi, the src/contrib/ec2 scripts do just what Paco suggests.
>
> minus the static IP stuff (you can use the scripts to login via cluster
> name, and spawn a tunnel for browsing nodes)
>
> that is, you can spawn any number of uniquely named, configured, and sized
> clusters, and you can increase their size independently as well. (shrinking
> is another matter altogether)
>
> ckw
>
> On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote:
>
>> Hi Karl,
>>
>> Rather than using separate key pairs, you can use EC2 security groups
>> to keep track of different clusters.
>>
>> Effectively, that requires a new security group for every cluster --
>> so just allocate a bunch of different ones in a config file, then have
>> the launch scripts draw from those. We also use EC2 static IP
>> addresses and then have a DNS entry named similarly to each security
>> group, associated with a static IP once that cluster is launched.
>> It's relatively simple to query the running instances and collect them
>> according to security groups.
>>
>> One way to handle detecting failures is just to attempt SSH in a loop.
>> Our rough estimate is that approximately 2% of the attempted EC2 nodes
>> fail at launch. So we allocate more than enough, given that rate.
>>
>> In a nutshell, that's one approach for managing a Hadoop cluster
>> remotely on EC2.
>>
>> Best,
>> Paco
>>
>>
>> On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:
>>>
>>> On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:

 This workflow could be initiated from a crontab -- totally automated.
 However, we still see occasional failures of the cluster, and must
 restart manually, but not often.  Stability for that has improved much
 since the 0.18 release.  For us, it's getting closer to total
 automation.

 FWIW, that's running on EC2 m1.xl instances.
>>>
>>> Same here.  I've always had the namenode and web interface be accessible,
>>> but sometimes I don't get the slave nodes - usually zero slaves when this
>>> happens, sometimes I only miss one or two.  My rough estimate is that
>>> this
>>> happens 1% of the time.
>>>
>>> I currently have to notice this and restart manually.  Do you have a good
>>> way to detect it?  I have several Hadoop clusters running at once with
>>> the
>>> same AWS image and SSH keypair, so I can't count running instances.  I
>>> could
>>> have a separate keypair per cluster and count instances with that
>>> keypair,
>>> but I'd like to be able to start clusters opportunistically, with more
>>> than
>>> one cluster doing the same kind of job on different data.
>>>
>>>
>>> Karl Anderson
>>> [EMAIL PROTECTED]
>>> http://monkey.org/~kra
>>>
>>>
>>>
>>>
>
> --
> Chris K Wensel
> [EMAIL PROTECTED]
> http://chris.wensel.net/
> http://www.cascading.org/
>
>


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Chris K Wensel


fyi, the src/contrib/ec2 scripts do just what Paco suggests.

minus the static IP stuff (you can use the scripts to login via  
cluster name, and spawn a tunnel for browsing nodes)


that is, you can spawn any number of uniquely named, configured, and  
sized clusters, and you can increase their size independently as well.  
(shrinking is another matter altogether)


ckw

On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote:


Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.

In a nutshell, that's one approach for managing a Hadoop cluster
remotely on EC2.

Best,
Paco


On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:


On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:


This workflow could be initiated from a crontab -- totally  
automated.

However, we still see occasional failures of the cluster, and must
restart manually, but not often.  Stability for that has improved  
much

since the 0.18 release.  For us, it's getting closer to total
automation.

FWIW, that's running on EC2 m1.xl instances.


Same here.  I've always had the namenode and web interface be  
accessible,
but sometimes I don't get the slave nodes - usually zero slaves  
when this
happens, sometimes I only miss one or two.  My rough estimate is  
that this

happens 1% of the time.

I currently have to notice this and restart manually.  Do you have  
a good
way to detect it?  I have several Hadoop clusters running at once  
with the
same AWS image and SSH keypair, so I can't count running  
instances.  I could
have a separate keypair per cluster and count instances with that  
keypair,
but I'd like to be able to start clusters opportunistically, with  
more than

one cluster doing the same kind of job on different data.


Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra






--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Steve Loughran

Paco NATHAN wrote:

Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.



We have a patch to add a ping() operation to all the services -namenode, 
datanode, task tracker, job tracker. With a proposal to make that 
remotely visible: https://issues.apache.org/jira/browse/HADOOP-3969 , 
you could hit every URL with an appropriately authenticated GET and see 
if it is live.



Another trick I like for long haul health checks is to use google talk 
and give every machine a login underneath (you can have them all in your 
own domain via google apps). Then you can use XMPP to monitor system 
health. A more advanced variant is to use it as a the command interface, 
which is very similar to how botnets work with IRC, though in that case 
the botnet herders are trying to manage 500,000 0wned boxes and don't 
want a reply.


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Paco NATHAN
Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.

In a nutshell, that's one approach for managing a Hadoop cluster
remotely on EC2.

Best,
Paco


On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:
>
> On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:
>>
>> This workflow could be initiated from a crontab -- totally automated.
>> However, we still see occasional failures of the cluster, and must
>> restart manually, but not often.  Stability for that has improved much
>> since the 0.18 release.  For us, it's getting closer to total
>> automation.
>>
>> FWIW, that's running on EC2 m1.xl instances.
>
> Same here.  I've always had the namenode and web interface be accessible,
> but sometimes I don't get the slave nodes - usually zero slaves when this
> happens, sometimes I only miss one or two.  My rough estimate is that this
> happens 1% of the time.
>
> I currently have to notice this and restart manually.  Do you have a good
> way to detect it?  I have several Hadoop clusters running at once with the
> same AWS image and SSH keypair, so I can't count running instances.  I could
> have a separate keypair per cluster and count instances with that keypair,
> but I'd like to be able to start clusters opportunistically, with more than
> one cluster doing the same kind of job on different data.
>
>
> Karl Anderson
> [EMAIL PROTECTED]
> http://monkey.org/~kra
>
>
>
>


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Karl Anderson


On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:


This workflow could be initiated from a crontab -- totally automated.
However, we still see occasional failures of the cluster, and must
restart manually, but not often.  Stability for that has improved much
since the 0.18 release.  For us, it's getting closer to total
automation.

FWIW, that's running on EC2 m1.xl instances.


Same here.  I've always had the namenode and web interface be  
accessible, but sometimes I don't get the slave nodes - usually zero  
slaves when this happens, sometimes I only miss one or two.  My rough  
estimate is that this happens 1% of the time.


I currently have to notice this and restart manually.  Do you have a  
good way to detect it?  I have several Hadoop clusters running at once  
with the same AWS image and SSH keypair, so I can't count running  
instances.  I could have a separate keypair per cluster and count  
instances with that keypair, but I'd like to be able to start clusters  
opportunistically, with more than one cluster doing the same kind of  
job on different data.



Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Steve Loughran

Paco NATHAN wrote:

What seems to be emerging here is a pattern for another special node
associated with a Hadoop cluster.

The need is to have a machine which can:
   * handle setup and shutdown of a Hadoop cluster on remote server resources
   * manage loading and retrieving data via a storage grid
   * interact and synchronize events via a message broker
   * capture telemetry (logging, exceptions, job/task stats) from the
remote cluster

On our team, one of the engineers named it a CloudController, as
distinct from JobTracker and NameNode.

In the discussion here, the CloudController pattern derives from
services provided by AWS. However, it could just as easily be mapped
to other elastic services for servers / storage buckets / message
queues -- based on other vendors, company data centers, etc.


It really depends on how the other infrastructures allocate their 
machines. FWIW, the Ec2 APIs while simple, are fairly limited : you dont 
get to spec out your topology, or hint at the data you want wo work with.




This pattern has come up in several other discussions I've had with
other companies making large use of Hadoop. We're generally trying to
address these issues:
   * long-running batch jobs and how to manage complex workflows for them
   * managing trade-offs between using a cloud provider (AWS,
Flexiscale, AppNexus, etc.) and using company data centers
   * managing trade-offs between cluster size vs. batch window time
vs. total cost

Our team chose to implement this functionality using Python scripts --
replacing the shell scripts. That makes it easier to handle the many
potential exceptions of leasing remote elastic resources.  FWIW, our
team is also moving these CloudController scripts to run under
RightScale, to manage AWS resources more effectively -- especially the
logging after node failures.

What do you think of having an optional CloudController added to the
definition of a Hadoop cluster?


Its very much a management problem.

If you look at   https://issues.apache.org/jira/browse/HADOOP-3628
you can see the hadoop services slowly getting the ability to be 
started/stopped more easily; then we need to work on the configuration 
to remove the need to push out XML files to every node to change behaviour.


As a result, we can bring up clusters with different configurations on 
existing, VMWare allocated or remote (EC2) farms. I say that, with the 
caveat that I haven't been playing with EC2 recently on account of 
having more local machines to hand, including a new laptop with enough 
cores to act like its own mini-cluster.


slideware:
http://people.apache.org/~stevel/slides/deploying_on_ec2.pdf
http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf

I'm putting in for an apachecon eu talk on the topic, "Dynamic Hadoop 
Clusters".


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Paco NATHAN
What seems to be emerging here is a pattern for another special node
associated with a Hadoop cluster.

The need is to have a machine which can:
   * handle setup and shutdown of a Hadoop cluster on remote server resources
   * manage loading and retrieving data via a storage grid
   * interact and synchronize events via a message broker
   * capture telemetry (logging, exceptions, job/task stats) from the
remote cluster

On our team, one of the engineers named it a CloudController, as
distinct from JobTracker and NameNode.

In the discussion here, the CloudController pattern derives from
services provided by AWS. However, it could just as easily be mapped
to other elastic services for servers / storage buckets / message
queues -- based on other vendors, company data centers, etc.

This pattern has come up in several other discussions I've had with
other companies making large use of Hadoop. We're generally trying to
address these issues:
   * long-running batch jobs and how to manage complex workflows for them
   * managing trade-offs between using a cloud provider (AWS,
Flexiscale, AppNexus, etc.) and using company data centers
   * managing trade-offs between cluster size vs. batch window time
vs. total cost

Our team chose to implement this functionality using Python scripts --
replacing the shell scripts. That makes it easier to handle the many
potential exceptions of leasing remote elastic resources.  FWIW, our
team is also moving these CloudController scripts to run under
RightScale, to manage AWS resources more effectively -- especially the
logging after node failures.

What do you think of having an optional CloudController added to the
definition of a Hadoop cluster?

Paco

On Thu, Oct 23, 2008 at 10:52 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
> Hey Stuart
>
> I did that for a client using Cascading events and SQS.
>
> When jobs completed, they dropped a message on SQS where a listener picked
> up new jobs and ran with them, or decided to kill off the cluster. The
> currently shipping EC2 scripts are suitable for having multiple simultaneous
> clusters for this purpose.
>
> Cascading has always and now Hadoop supports (thanks Tom) raw file access on
> S3, so this is quite natural. This is the best approach as data is pulled
> directly into the Mapper, instead of onto HDFS first, then read into the
> Mapper from HDFS.
>
> YMMV
>
> chris
>
> On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote:
>
>> Hi folks,
>> Anybody tried scripting Hadoop on EC2 to...
>> 1. Launch a cluster
>> 2. Pull data from S3
>> 3. Run a job
>> 4. Copy results to S3
>> 5. Terminate the cluster
>> ... without any user interaction?
>>
>> -Stuart
>
> --
> Chris K Wensel
> [EMAIL PROTECTED]
> http://chris.wensel.net/
> http://www.cascading.org/
>
>


Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Paco NATHAN
Hi Stuart,

Yes, we do that.  Ditto on most of what Chris described.

We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from
S3 when it launches. That controls the versions for tools/frameworks,
instead of redoing an AMI each time a tool has an update.

A remote server -- in our data center -- acts as a controller, to
launch and manage the cluster.  FWIW, an engineer here wrote those
scripts in Python using "boto".  Had to patch boto, which was
submitted.

The mapper for the first MR job in the workflow streams in data from
S3.  Reducers in subsequent jobs have the option to write output to S3
(as Chris mentioned)

After the last MR job in the workflow completes, it pushes a message
into SQS.  The remote server polls SQS, then performs a shutdown of
the cluster.

We may replace use of SQS with RabbitMQ -- more flexible to broker
other kinds of messages between Hadoop on AWS and the
controller/consumer of results back in our data center.


This workflow could be initiated from a crontab -- totally automated.
However, we still see occasional failures of the cluster, and must
restart manually, but not often.  Stability for that has improved much
since the 0.18 release.  For us, it's getting closer to total
automation.

FWIW, that's running on EC2 m1.xl instances.

Paco



On Thu, Oct 23, 2008 at 9:47 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> Hi folks,
> Anybody tried scripting Hadoop on EC2 to...
> 1. Launch a cluster
> 2. Pull data from S3
> 3. Run a job
> 4. Copy results to S3
> 5. Terminate the cluster
> ... without any user interaction?
>
> -Stuart
>


Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Per Jacobsson
We're doing the same thing, but doing the scheduling just with shell scripts
running on a machine outside of the Hadoop cluster. It works but we're
getting into a bit of scripting hell as things get more complex.

We're using distcp to first copy the files the jobs need from S3 to HDFS and
it works nicely. When going the other direction we have to pull the data
down from HDFS to one of the EC2 machines and then push back it up to S3. If
I understand things right the support for that will be better in 0.19.
/ Per

On Thu, Oct 23, 2008 at 8:52 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote:

> Hey Stuart
>
> I did that for a client using Cascading events and SQS.
>
> When jobs completed, they dropped a message on SQS where a listener picked
> up new jobs and ran with them, or decided to kill off the cluster. The
> currently shipping EC2 scripts are suitable for having multiple simultaneous
> clusters for this purpose.
>
> Cascading has always and now Hadoop supports (thanks Tom) raw file access
> on S3, so this is quite natural. This is the best approach as data is pulled
> directly into the Mapper, instead of onto HDFS first, then read into the
> Mapper from HDFS.
>
> YMMV
>
> chris
>
>
> On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote:
>
>  Hi folks,
>> Anybody tried scripting Hadoop on EC2 to...
>> 1. Launch a cluster
>> 2. Pull data from S3
>> 3. Run a job
>> 4. Copy results to S3
>> 5. Terminate the cluster
>> ... without any user interaction?
>>
>> -Stuart
>>
>
> --
> Chris K Wensel
> [EMAIL PROTECTED]
> http://chris.wensel.net/
> http://www.cascading.org/
>
>


Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Chris K Wensel

Hey Stuart

I did that for a client using Cascading events and SQS.

When jobs completed, they dropped a message on SQS where a listener  
picked up new jobs and ran with them, or decided to kill off the  
cluster. The currently shipping EC2 scripts are suitable for having  
multiple simultaneous clusters for this purpose.


Cascading has always and now Hadoop supports (thanks Tom) raw file  
access on S3, so this is quite natural. This is the best approach as  
data is pulled directly into the Mapper, instead of onto HDFS first,  
then read into the Mapper from HDFS.


YMMV

chris

On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote:


Hi folks,
Anybody tried scripting Hadoop on EC2 to...
1. Launch a cluster
2. Pull data from S3
3. Run a job
4. Copy results to S3
5. Terminate the cluster
... without any user interaction?

-Stuart


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Auto-shutdown for EC2 clusters

2008-10-23 Thread Stuart Sierra
Hi folks,
Anybody tried scripting Hadoop on EC2 to...
1. Launch a cluster
2. Pull data from S3
3. Run a job
4. Copy results to S3
5. Terminate the cluster
... without any user interaction?

-Stuart