Re: Auto-shutdown for EC2 clusters
I've just created a basic script to do something similar for running a benchmark on EC2. See https://issues.apache.org/jira/browse/HADOOP-4382. As it stands the code for detecting when Hadoop is ready to accept jobs is simplistic, to say the least, so any ideas for improvement would be great. Thanks, Tom On Fri, Oct 24, 2008 at 11:53 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote: > > fyi, the src/contrib/ec2 scripts do just what Paco suggests. > > minus the static IP stuff (you can use the scripts to login via cluster > name, and spawn a tunnel for browsing nodes) > > that is, you can spawn any number of uniquely named, configured, and sized > clusters, and you can increase their size independently as well. (shrinking > is another matter altogether) > > ckw > > On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote: > >> Hi Karl, >> >> Rather than using separate key pairs, you can use EC2 security groups >> to keep track of different clusters. >> >> Effectively, that requires a new security group for every cluster -- >> so just allocate a bunch of different ones in a config file, then have >> the launch scripts draw from those. We also use EC2 static IP >> addresses and then have a DNS entry named similarly to each security >> group, associated with a static IP once that cluster is launched. >> It's relatively simple to query the running instances and collect them >> according to security groups. >> >> One way to handle detecting failures is just to attempt SSH in a loop. >> Our rough estimate is that approximately 2% of the attempted EC2 nodes >> fail at launch. So we allocate more than enough, given that rate. >> >> In a nutshell, that's one approach for managing a Hadoop cluster >> remotely on EC2. >> >> Best, >> Paco >> >> >> On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <[EMAIL PROTECTED]> wrote: >>> >>> On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. >>> >>> Same here. I've always had the namenode and web interface be accessible, >>> but sometimes I don't get the slave nodes - usually zero slaves when this >>> happens, sometimes I only miss one or two. My rough estimate is that >>> this >>> happens 1% of the time. >>> >>> I currently have to notice this and restart manually. Do you have a good >>> way to detect it? I have several Hadoop clusters running at once with >>> the >>> same AWS image and SSH keypair, so I can't count running instances. I >>> could >>> have a separate keypair per cluster and count instances with that >>> keypair, >>> but I'd like to be able to start clusters opportunistically, with more >>> than >>> one cluster doing the same kind of job on different data. >>> >>> >>> Karl Anderson >>> [EMAIL PROTECTED] >>> http://monkey.org/~kra >>> >>> >>> >>> > > -- > Chris K Wensel > [EMAIL PROTECTED] > http://chris.wensel.net/ > http://www.cascading.org/ > >
Re: Auto-shutdown for EC2 clusters
fyi, the src/contrib/ec2 scripts do just what Paco suggests. minus the static IP stuff (you can use the scripts to login via cluster name, and spawn a tunnel for browsing nodes) that is, you can spawn any number of uniquely named, configured, and sized clusters, and you can increase their size independently as well. (shrinking is another matter altogether) ckw On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote: Hi Karl, Rather than using separate key pairs, you can use EC2 security groups to keep track of different clusters. Effectively, that requires a new security group for every cluster -- so just allocate a bunch of different ones in a config file, then have the launch scripts draw from those. We also use EC2 static IP addresses and then have a DNS entry named similarly to each security group, associated with a static IP once that cluster is launched. It's relatively simple to query the running instances and collect them according to security groups. One way to handle detecting failures is just to attempt SSH in a loop. Our rough estimate is that approximately 2% of the attempted EC2 nodes fail at launch. So we allocate more than enough, given that rate. In a nutshell, that's one approach for managing a Hadoop cluster remotely on EC2. Best, Paco On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <[EMAIL PROTECTED]> wrote: On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. Same here. I've always had the namenode and web interface be accessible, but sometimes I don't get the slave nodes - usually zero slaves when this happens, sometimes I only miss one or two. My rough estimate is that this happens 1% of the time. I currently have to notice this and restart manually. Do you have a good way to detect it? I have several Hadoop clusters running at once with the same AWS image and SSH keypair, so I can't count running instances. I could have a separate keypair per cluster and count instances with that keypair, but I'd like to be able to start clusters opportunistically, with more than one cluster doing the same kind of job on different data. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Auto-shutdown for EC2 clusters
Paco NATHAN wrote: Hi Karl, Rather than using separate key pairs, you can use EC2 security groups to keep track of different clusters. Effectively, that requires a new security group for every cluster -- so just allocate a bunch of different ones in a config file, then have the launch scripts draw from those. We also use EC2 static IP addresses and then have a DNS entry named similarly to each security group, associated with a static IP once that cluster is launched. It's relatively simple to query the running instances and collect them according to security groups. One way to handle detecting failures is just to attempt SSH in a loop. Our rough estimate is that approximately 2% of the attempted EC2 nodes fail at launch. So we allocate more than enough, given that rate. We have a patch to add a ping() operation to all the services -namenode, datanode, task tracker, job tracker. With a proposal to make that remotely visible: https://issues.apache.org/jira/browse/HADOOP-3969 , you could hit every URL with an appropriately authenticated GET and see if it is live. Another trick I like for long haul health checks is to use google talk and give every machine a login underneath (you can have them all in your own domain via google apps). Then you can use XMPP to monitor system health. A more advanced variant is to use it as a the command interface, which is very similar to how botnets work with IRC, though in that case the botnet herders are trying to manage 500,000 0wned boxes and don't want a reply.
Re: Auto-shutdown for EC2 clusters
Hi Karl, Rather than using separate key pairs, you can use EC2 security groups to keep track of different clusters. Effectively, that requires a new security group for every cluster -- so just allocate a bunch of different ones in a config file, then have the launch scripts draw from those. We also use EC2 static IP addresses and then have a DNS entry named similarly to each security group, associated with a static IP once that cluster is launched. It's relatively simple to query the running instances and collect them according to security groups. One way to handle detecting failures is just to attempt SSH in a loop. Our rough estimate is that approximately 2% of the attempted EC2 nodes fail at launch. So we allocate more than enough, given that rate. In a nutshell, that's one approach for managing a Hadoop cluster remotely on EC2. Best, Paco On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson <[EMAIL PROTECTED]> wrote: > > On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: >> >> This workflow could be initiated from a crontab -- totally automated. >> However, we still see occasional failures of the cluster, and must >> restart manually, but not often. Stability for that has improved much >> since the 0.18 release. For us, it's getting closer to total >> automation. >> >> FWIW, that's running on EC2 m1.xl instances. > > Same here. I've always had the namenode and web interface be accessible, > but sometimes I don't get the slave nodes - usually zero slaves when this > happens, sometimes I only miss one or two. My rough estimate is that this > happens 1% of the time. > > I currently have to notice this and restart manually. Do you have a good > way to detect it? I have several Hadoop clusters running at once with the > same AWS image and SSH keypair, so I can't count running instances. I could > have a separate keypair per cluster and count instances with that keypair, > but I'd like to be able to start clusters opportunistically, with more than > one cluster doing the same kind of job on different data. > > > Karl Anderson > [EMAIL PROTECTED] > http://monkey.org/~kra > > > >
Re: Auto-shutdown for EC2 clusters
On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. Same here. I've always had the namenode and web interface be accessible, but sometimes I don't get the slave nodes - usually zero slaves when this happens, sometimes I only miss one or two. My rough estimate is that this happens 1% of the time. I currently have to notice this and restart manually. Do you have a good way to detect it? I have several Hadoop clusters running at once with the same AWS image and SSH keypair, so I can't count running instances. I could have a separate keypair per cluster and count instances with that keypair, but I'd like to be able to start clusters opportunistically, with more than one cluster doing the same kind of job on different data. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Re: Auto-shutdown for EC2 clusters
Paco NATHAN wrote: What seems to be emerging here is a pattern for another special node associated with a Hadoop cluster. The need is to have a machine which can: * handle setup and shutdown of a Hadoop cluster on remote server resources * manage loading and retrieving data via a storage grid * interact and synchronize events via a message broker * capture telemetry (logging, exceptions, job/task stats) from the remote cluster On our team, one of the engineers named it a CloudController, as distinct from JobTracker and NameNode. In the discussion here, the CloudController pattern derives from services provided by AWS. However, it could just as easily be mapped to other elastic services for servers / storage buckets / message queues -- based on other vendors, company data centers, etc. It really depends on how the other infrastructures allocate their machines. FWIW, the Ec2 APIs while simple, are fairly limited : you dont get to spec out your topology, or hint at the data you want wo work with. This pattern has come up in several other discussions I've had with other companies making large use of Hadoop. We're generally trying to address these issues: * long-running batch jobs and how to manage complex workflows for them * managing trade-offs between using a cloud provider (AWS, Flexiscale, AppNexus, etc.) and using company data centers * managing trade-offs between cluster size vs. batch window time vs. total cost Our team chose to implement this functionality using Python scripts -- replacing the shell scripts. That makes it easier to handle the many potential exceptions of leasing remote elastic resources. FWIW, our team is also moving these CloudController scripts to run under RightScale, to manage AWS resources more effectively -- especially the logging after node failures. What do you think of having an optional CloudController added to the definition of a Hadoop cluster? Its very much a management problem. If you look at https://issues.apache.org/jira/browse/HADOOP-3628 you can see the hadoop services slowly getting the ability to be started/stopped more easily; then we need to work on the configuration to remove the need to push out XML files to every node to change behaviour. As a result, we can bring up clusters with different configurations on existing, VMWare allocated or remote (EC2) farms. I say that, with the caveat that I haven't been playing with EC2 recently on account of having more local machines to hand, including a new laptop with enough cores to act like its own mini-cluster. slideware: http://people.apache.org/~stevel/slides/deploying_on_ec2.pdf http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf I'm putting in for an apachecon eu talk on the topic, "Dynamic Hadoop Clusters". -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Auto-shutdown for EC2 clusters
What seems to be emerging here is a pattern for another special node associated with a Hadoop cluster. The need is to have a machine which can: * handle setup and shutdown of a Hadoop cluster on remote server resources * manage loading and retrieving data via a storage grid * interact and synchronize events via a message broker * capture telemetry (logging, exceptions, job/task stats) from the remote cluster On our team, one of the engineers named it a CloudController, as distinct from JobTracker and NameNode. In the discussion here, the CloudController pattern derives from services provided by AWS. However, it could just as easily be mapped to other elastic services for servers / storage buckets / message queues -- based on other vendors, company data centers, etc. This pattern has come up in several other discussions I've had with other companies making large use of Hadoop. We're generally trying to address these issues: * long-running batch jobs and how to manage complex workflows for them * managing trade-offs between using a cloud provider (AWS, Flexiscale, AppNexus, etc.) and using company data centers * managing trade-offs between cluster size vs. batch window time vs. total cost Our team chose to implement this functionality using Python scripts -- replacing the shell scripts. That makes it easier to handle the many potential exceptions of leasing remote elastic resources. FWIW, our team is also moving these CloudController scripts to run under RightScale, to manage AWS resources more effectively -- especially the logging after node failures. What do you think of having an optional CloudController added to the definition of a Hadoop cluster? Paco On Thu, Oct 23, 2008 at 10:52 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote: > Hey Stuart > > I did that for a client using Cascading events and SQS. > > When jobs completed, they dropped a message on SQS where a listener picked > up new jobs and ran with them, or decided to kill off the cluster. The > currently shipping EC2 scripts are suitable for having multiple simultaneous > clusters for this purpose. > > Cascading has always and now Hadoop supports (thanks Tom) raw file access on > S3, so this is quite natural. This is the best approach as data is pulled > directly into the Mapper, instead of onto HDFS first, then read into the > Mapper from HDFS. > > YMMV > > chris > > On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote: > >> Hi folks, >> Anybody tried scripting Hadoop on EC2 to... >> 1. Launch a cluster >> 2. Pull data from S3 >> 3. Run a job >> 4. Copy results to S3 >> 5. Terminate the cluster >> ... without any user interaction? >> >> -Stuart > > -- > Chris K Wensel > [EMAIL PROTECTED] > http://chris.wensel.net/ > http://www.cascading.org/ > >
Re: Auto-shutdown for EC2 clusters
Hi Stuart, Yes, we do that. Ditto on most of what Chris described. We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from S3 when it launches. That controls the versions for tools/frameworks, instead of redoing an AMI each time a tool has an update. A remote server -- in our data center -- acts as a controller, to launch and manage the cluster. FWIW, an engineer here wrote those scripts in Python using "boto". Had to patch boto, which was submitted. The mapper for the first MR job in the workflow streams in data from S3. Reducers in subsequent jobs have the option to write output to S3 (as Chris mentioned) After the last MR job in the workflow completes, it pushes a message into SQS. The remote server polls SQS, then performs a shutdown of the cluster. We may replace use of SQS with RabbitMQ -- more flexible to broker other kinds of messages between Hadoop on AWS and the controller/consumer of results back in our data center. This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. Paco On Thu, Oct 23, 2008 at 9:47 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > Hi folks, > Anybody tried scripting Hadoop on EC2 to... > 1. Launch a cluster > 2. Pull data from S3 > 3. Run a job > 4. Copy results to S3 > 5. Terminate the cluster > ... without any user interaction? > > -Stuart >
Re: Auto-shutdown for EC2 clusters
We're doing the same thing, but doing the scheduling just with shell scripts running on a machine outside of the Hadoop cluster. It works but we're getting into a bit of scripting hell as things get more complex. We're using distcp to first copy the files the jobs need from S3 to HDFS and it works nicely. When going the other direction we have to pull the data down from HDFS to one of the EC2 machines and then push back it up to S3. If I understand things right the support for that will be better in 0.19. / Per On Thu, Oct 23, 2008 at 8:52 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote: > Hey Stuart > > I did that for a client using Cascading events and SQS. > > When jobs completed, they dropped a message on SQS where a listener picked > up new jobs and ran with them, or decided to kill off the cluster. The > currently shipping EC2 scripts are suitable for having multiple simultaneous > clusters for this purpose. > > Cascading has always and now Hadoop supports (thanks Tom) raw file access > on S3, so this is quite natural. This is the best approach as data is pulled > directly into the Mapper, instead of onto HDFS first, then read into the > Mapper from HDFS. > > YMMV > > chris > > > On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote: > > Hi folks, >> Anybody tried scripting Hadoop on EC2 to... >> 1. Launch a cluster >> 2. Pull data from S3 >> 3. Run a job >> 4. Copy results to S3 >> 5. Terminate the cluster >> ... without any user interaction? >> >> -Stuart >> > > -- > Chris K Wensel > [EMAIL PROTECTED] > http://chris.wensel.net/ > http://www.cascading.org/ > >
Re: Auto-shutdown for EC2 clusters
Hey Stuart I did that for a client using Cascading events and SQS. When jobs completed, they dropped a message on SQS where a listener picked up new jobs and ran with them, or decided to kill off the cluster. The currently shipping EC2 scripts are suitable for having multiple simultaneous clusters for this purpose. Cascading has always and now Hadoop supports (thanks Tom) raw file access on S3, so this is quite natural. This is the best approach as data is pulled directly into the Mapper, instead of onto HDFS first, then read into the Mapper from HDFS. YMMV chris On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote: Hi folks, Anybody tried scripting Hadoop on EC2 to... 1. Launch a cluster 2. Pull data from S3 3. Run a job 4. Copy results to S3 5. Terminate the cluster ... without any user interaction? -Stuart -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Auto-shutdown for EC2 clusters
Hi folks, Anybody tried scripting Hadoop on EC2 to... 1. Launch a cluster 2. Pull data from S3 3. Run a job 4. Copy results to S3 5. Terminate the cluster ... without any user interaction? -Stuart