Re: Hadoop & EC2
Hi Tom, This clears up my questions. Thanks! Ryan On Thu, Sep 4, 2008 at 9:21 AM, Tom White <[EMAIL PROTECTED]> wrote: > On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> I'm noticing that using bin/hadoop fs -put ... svn://... is uploading >> multi-gigabyte files in ~64MB chunks. > > That's because S3Filesystem stores files as 64MB blocks on S3. > >> Then, when this is copied from >> S3 into HDFS using bin/hadoop distcp. Once the files are there and the >> job begins, it looks like it's breaking up the 4 multigigabyte text >> files into about 225 maps. Does this mean that each map is roughly >> processing 64MB of data each? > > Yes, HDFS stores files as 64MB blocks too, and map input is split by > default so each map processes one block. > >>If so, is there any way to change this >> so that I can get my map tasks to process more data at a time? I'm >> curious if this will shorten the time it takes to run the program. > > You could try increasing the HDFS block size. 128MB is actually > usually a better value, for this very reason. > > In the future https://issues.apache.org/jira/browse/HADOOP-2560 will > help here too. > >> >> Tom, in your article about Hadoop + EC2 you mention processing about >> 100GB of logs in under 6 minutes or so. > > In this article: > http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873, > it took 35 minutes to run the job. I'm planning on doing some > benchmarking on EC2 fairly soon, which should help us improve the > performance of Hadoop on EC2. It's worth remarking that this was > running on small instances. The larger instances perform a lot better > in my experience. > >> Do you remember how many EC2 >> instances you had running, and also how many map tasks did you have to >> operate on the 100GB? Was each map task handling about 1GB each? > > I was running 20 nodes, and each map task was handling a HDFS block, 64MB. > > Hope this helps, > > Tom >
Re: Hadoop & EC2
On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > I'm noticing that using bin/hadoop fs -put ... svn://... is uploading > multi-gigabyte files in ~64MB chunks. That's because S3Filesystem stores files as 64MB blocks on S3. > Then, when this is copied from > S3 into HDFS using bin/hadoop distcp. Once the files are there and the > job begins, it looks like it's breaking up the 4 multigigabyte text > files into about 225 maps. Does this mean that each map is roughly > processing 64MB of data each? Yes, HDFS stores files as 64MB blocks too, and map input is split by default so each map processes one block. >If so, is there any way to change this > so that I can get my map tasks to process more data at a time? I'm > curious if this will shorten the time it takes to run the program. You could try increasing the HDFS block size. 128MB is actually usually a better value, for this very reason. In the future https://issues.apache.org/jira/browse/HADOOP-2560 will help here too. > > Tom, in your article about Hadoop + EC2 you mention processing about > 100GB of logs in under 6 minutes or so. In this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873, it took 35 minutes to run the job. I'm planning on doing some benchmarking on EC2 fairly soon, which should help us improve the performance of Hadoop on EC2. It's worth remarking that this was running on small instances. The larger instances perform a lot better in my experience. > Do you remember how many EC2 > instances you had running, and also how many map tasks did you have to > operate on the 100GB? Was each map task handling about 1GB each? I was running 20 nodes, and each map task was handling a HDFS block, 64MB. Hope this helps, Tom
Re: Hadoop & EC2
I'm noticing that using bin/hadoop fs -put ... svn://... is uploading multi-gigabyte files in ~64MB chunks. Then, when this is copied from S3 into HDFS using bin/hadoop distcp. Once the files are there and the job begins, it looks like it's breaking up the 4 multigigabyte text files into about 225 maps. Does this mean that each map is roughly processing 64MB of data each? If so, is there any way to change this so that I can get my map tasks to process more data at a time? I'm curious if this will shorten the time it takes to run the program. Tom, in your article about Hadoop + EC2 you mention processing about 100GB of logs in under 6 minutes or so. Do you remember how many EC2 instances you had running, and also how many map tasks did you have to operate on the 100GB? Was each map task handling about 1GB each? Thanks, Ryan On Wed, Sep 3, 2008 at 11:21 AM, Tom White <[EMAIL PROTECTED]> wrote: > On Wed, Sep 3, 2008 at 3:05 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> Tom, >> >> I noticed that you mentioned using Amazon's new elastic block store as >> an alternative to using S3. Right now I'm testing pushing data to S3, >> then moving it from S3 into HDFS once the Hadoop cluster is up and >> running in EC2. It works pretty well -- moving data from S3 to HDFS is >> fast when the data in S3 is broken up into multiple files, since >> bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the >> data. > > Yes, this is a good-enough solution for many applications. > >> >> Are there any real advantages to using the new elastic block store? Is >> moving data from the elastic block store into HDFS any faster than >> doing it from S3? Or can HDFS essentially live inside of the elastic >> block store? > > Bandwidth between EBS and EC2 is better than between S3 and EC2, so if > you intend to run MapReduce on your data then you might consider > running an elastic Hadoop cluster that stores data on EBS-backed HDFS. > The nice thing is that you can shut down the cluster when you're not > using it and then restart it later. But if you have other applications > that need to access data from S3, then this may not be appropriate. > Also, it may not be as fast as HDFS using local disks for storage. > > This is a new area, and I haven't done any measurements, so a lot of > this is conjecture on my part. Hadoop on EBS doesn't exist yet - but > it looks like a natural fit. > >> >> Thanks! >> >> Ryan >> >> >> On Wed, Sep 3, 2008 at 9:54 AM, Tom White <[EMAIL PROTECTED]> wrote: >>> There's a case study with some numbers in it from a presentation I >>> gave on Hadoop and AWS in London last month, which you may find >>> interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. >>> >>> tim robertson <[EMAIL PROTECTED]> wrote: >>>> For these small >>>> datasets, you might find it useful - let me know if I should spend >>>> time finishing it (Or submit help?) - it is really very simple. >>> >>> This sounds very useful. Please consider creating a Jira and >>> submitting the code (even if it's not "finished" folks might like to >>> see it). Thanks. >>> >>> Tom >>> >>>> >>>> Cheers >>>> >>>> Tim >>>> >>>> >>>> >>>> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>>> Hi Tim, >>>>> >>>>> Are you mostly just processing/parsing textual log files? How many >>>>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >>>>> many did you configure in your JobConf? Just trying to get an idea of >>>>> what to expect in terms of performance. I'm noticing that it takes >>>>> about 16 minutes to transfer about 15GB of textual uncompressed data >>>>> from S3 into HDFS after the cluster has started with 15 nodes. I was >>>>> expecting this to take a shorter amount of time, but maybe I'm >>>>> incorrect in my assumptions. I am also noticing that it takes about 15 >>>>> minutes to parse through the 15GB of data with a 15 node cluster. >>>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>>> >>>>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >>>>>> I have been processing only 100s GBs on EC2, not 1000's and using 20 >>>>>> nodes and really only in exploration and testing
Re: Hadoop & EC2
On Wed, Sep 3, 2008 at 3:05 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Tom, > > I noticed that you mentioned using Amazon's new elastic block store as > an alternative to using S3. Right now I'm testing pushing data to S3, > then moving it from S3 into HDFS once the Hadoop cluster is up and > running in EC2. It works pretty well -- moving data from S3 to HDFS is > fast when the data in S3 is broken up into multiple files, since > bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the > data. Yes, this is a good-enough solution for many applications. > > Are there any real advantages to using the new elastic block store? Is > moving data from the elastic block store into HDFS any faster than > doing it from S3? Or can HDFS essentially live inside of the elastic > block store? Bandwidth between EBS and EC2 is better than between S3 and EC2, so if you intend to run MapReduce on your data then you might consider running an elastic Hadoop cluster that stores data on EBS-backed HDFS. The nice thing is that you can shut down the cluster when you're not using it and then restart it later. But if you have other applications that need to access data from S3, then this may not be appropriate. Also, it may not be as fast as HDFS using local disks for storage. This is a new area, and I haven't done any measurements, so a lot of this is conjecture on my part. Hadoop on EBS doesn't exist yet - but it looks like a natural fit. > > Thanks! > > Ryan > > > On Wed, Sep 3, 2008 at 9:54 AM, Tom White <[EMAIL PROTECTED]> wrote: >> There's a case study with some numbers in it from a presentation I >> gave on Hadoop and AWS in London last month, which you may find >> interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. >> >> tim robertson <[EMAIL PROTECTED]> wrote: >>> For these small >>> datasets, you might find it useful - let me know if I should spend >>> time finishing it (Or submit help?) - it is really very simple. >> >> This sounds very useful. Please consider creating a Jira and >> submitting the code (even if it's not "finished" folks might like to >> see it). Thanks. >> >> Tom >> >>> >>> Cheers >>> >>> Tim >>> >>> >>> >>> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>> Hi Tim, >>>> >>>> Are you mostly just processing/parsing textual log files? How many >>>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >>>> many did you configure in your JobConf? Just trying to get an idea of >>>> what to expect in terms of performance. I'm noticing that it takes >>>> about 16 minutes to transfer about 15GB of textual uncompressed data >>>> from S3 into HDFS after the cluster has started with 15 nodes. I was >>>> expecting this to take a shorter amount of time, but maybe I'm >>>> incorrect in my assumptions. I am also noticing that it takes about 15 >>>> minutes to parse through the 15GB of data with a 15 node cluster. >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >>>>> I have been processing only 100s GBs on EC2, not 1000's and using 20 >>>>> nodes and really only in exploration and testing phase right now. >>>>> >>>>> >>>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> >>>>> wrote: >>>>>> Hi Ryan, >>>>>> >>>>>> Just a heads up, if you require more than the 20 node limit, Amazon >>>>>> provides a form to request a higher limit: >>>>>> >>>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>>>>> >>>>>> Andrew >>>>>> >>>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>>>>> Hello all, >>>>>>> >>>>>>> I'm curious to see how many people are using EC2 to execute their >>>>>>> Hadoop cluster and map/reduce programs, and how many are using >>>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>>>>> bit crippling when one wants to process many gigabytes of data. Has >>>>>>> anyone found this to be the case? How much data are people processing >>>>>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>>>>> >>>>>>> Thanks, >>>>>>> Ryan >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Re: Hadoop & EC2
Will do Tom... I am about to go on vacation for 3 weeks, so don't expect anything super soon. It is nothing too get excited about but is enough to get people into the concepts and thinking of MR and running quickly in the IDE. Cheers Tim On Wed, Sep 3, 2008 at 3:54 PM, Tom White <[EMAIL PROTECTED]> wrote: > There's a case study with some numbers in it from a presentation I > gave on Hadoop and AWS in London last month, which you may find > interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. > > tim robertson <[EMAIL PROTECTED]> wrote: >> For these small >> datasets, you might find it useful - let me know if I should spend >> time finishing it (Or submit help?) - it is really very simple. > > This sounds very useful. Please consider creating a Jira and > submitting the code (even if it's not "finished" folks might like to > see it). Thanks. > > Tom > >> >> Cheers >> >> Tim >> >> >> >> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>> Hi Tim, >>> >>> Are you mostly just processing/parsing textual log files? How many >>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >>> many did you configure in your JobConf? Just trying to get an idea of >>> what to expect in terms of performance. I'm noticing that it takes >>> about 16 minutes to transfer about 15GB of textual uncompressed data >>> from S3 into HDFS after the cluster has started with 15 nodes. I was >>> expecting this to take a shorter amount of time, but maybe I'm >>> incorrect in my assumptions. I am also noticing that it takes about 15 >>> minutes to parse through the 15GB of data with a 15 node cluster. >>> >>> Thanks, >>> Ryan >>> >>> >>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >>>> I have been processing only 100s GBs on EC2, not 1000's and using 20 >>>> nodes and really only in exploration and testing phase right now. >>>> >>>> >>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >>>>> Hi Ryan, >>>>> >>>>> Just a heads up, if you require more than the 20 node limit, Amazon >>>>> provides a form to request a higher limit: >>>>> >>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>>>> >>>>> Andrew >>>>> >>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>>>> Hello all, >>>>>> >>>>>> I'm curious to see how many people are using EC2 to execute their >>>>>> Hadoop cluster and map/reduce programs, and how many are using >>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>>>> bit crippling when one wants to process many gigabytes of data. Has >>>>>> anyone found this to be the case? How much data are people processing >>>>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>>>> >>>>>> Thanks, >>>>>> Ryan >>>>>> >>>>> >>>> >>> >> >
Re: Hadoop & EC2
Tom, I noticed that you mentioned using Amazon's new elastic block store as an alternative to using S3. Right now I'm testing pushing data to S3, then moving it from S3 into HDFS once the Hadoop cluster is up and running in EC2. It works pretty well -- moving data from S3 to HDFS is fast when the data in S3 is broken up into multiple files, since bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the data. Are there any real advantages to using the new elastic block store? Is moving data from the elastic block store into HDFS any faster than doing it from S3? Or can HDFS essentially live inside of the elastic block store? Thanks! Ryan On Wed, Sep 3, 2008 at 9:54 AM, Tom White <[EMAIL PROTECTED]> wrote: > There's a case study with some numbers in it from a presentation I > gave on Hadoop and AWS in London last month, which you may find > interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. > > tim robertson <[EMAIL PROTECTED]> wrote: >> For these small >> datasets, you might find it useful - let me know if I should spend >> time finishing it (Or submit help?) - it is really very simple. > > This sounds very useful. Please consider creating a Jira and > submitting the code (even if it's not "finished" folks might like to > see it). Thanks. > > Tom > >> >> Cheers >> >> Tim >> >> >> >> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>> Hi Tim, >>> >>> Are you mostly just processing/parsing textual log files? How many >>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >>> many did you configure in your JobConf? Just trying to get an idea of >>> what to expect in terms of performance. I'm noticing that it takes >>> about 16 minutes to transfer about 15GB of textual uncompressed data >>> from S3 into HDFS after the cluster has started with 15 nodes. I was >>> expecting this to take a shorter amount of time, but maybe I'm >>> incorrect in my assumptions. I am also noticing that it takes about 15 >>> minutes to parse through the 15GB of data with a 15 node cluster. >>> >>> Thanks, >>> Ryan >>> >>> >>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >>>> I have been processing only 100s GBs on EC2, not 1000's and using 20 >>>> nodes and really only in exploration and testing phase right now. >>>> >>>> >>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >>>>> Hi Ryan, >>>>> >>>>> Just a heads up, if you require more than the 20 node limit, Amazon >>>>> provides a form to request a higher limit: >>>>> >>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>>>> >>>>> Andrew >>>>> >>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>>>> Hello all, >>>>>> >>>>>> I'm curious to see how many people are using EC2 to execute their >>>>>> Hadoop cluster and map/reduce programs, and how many are using >>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>>>> bit crippling when one wants to process many gigabytes of data. Has >>>>>> anyone found this to be the case? How much data are people processing >>>>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>>>> >>>>>> Thanks, >>>>>> Ryan >>>>>> >>>>> >>>> >>> >> >
Re: Hadoop & EC2
There's a case study with some numbers in it from a presentation I gave on Hadoop and AWS in London last month, which you may find interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. tim robertson <[EMAIL PROTECTED]> wrote: > For these small > datasets, you might find it useful - let me know if I should spend > time finishing it (Or submit help?) - it is really very simple. This sounds very useful. Please consider creating a Jira and submitting the code (even if it's not "finished" folks might like to see it). Thanks. Tom > > Cheers > > Tim > > > > On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> Hi Tim, >> >> Are you mostly just processing/parsing textual log files? How many >> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >> many did you configure in your JobConf? Just trying to get an idea of >> what to expect in terms of performance. I'm noticing that it takes >> about 16 minutes to transfer about 15GB of textual uncompressed data >> from S3 into HDFS after the cluster has started with 15 nodes. I was >> expecting this to take a shorter amount of time, but maybe I'm >> incorrect in my assumptions. I am also noticing that it takes about 15 >> minutes to parse through the 15GB of data with a 15 node cluster. >> >> Thanks, >> Ryan >> >> >> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >>> I have been processing only 100s GBs on EC2, not 1000's and using 20 >>> nodes and really only in exploration and testing phase right now. >>> >>> >>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >>>> Hi Ryan, >>>> >>>> Just a heads up, if you require more than the 20 node limit, Amazon >>>> provides a form to request a higher limit: >>>> >>>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>>> >>>> Andrew >>>> >>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>>> Hello all, >>>>> >>>>> I'm curious to see how many people are using EC2 to execute their >>>>> Hadoop cluster and map/reduce programs, and how many are using >>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>>> bit crippling when one wants to process many gigabytes of data. Has >>>>> anyone found this to be the case? How much data are people processing >>>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>> >>> >> >
Re: Hadoop & EC2
I assume that Karl means 'regions' - i.e. Europe or US. I don't think S3 has the same premise of availability zones that EC2 has. Between different regions, data transfer is 1) charged for and 2) likely slower between EC2 and S3-Europe. Transfer between S3-US and EC2 is free of charge, and should be significantly quicker. Russell Ryan LeCompte wrote: How can you ensure that the S3 buckets and EC2 instances belong to a certain zone? Ryan On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <[EMAIL PROTECTED]> wrote: On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. I'm seeing much faster speeds. With 128 nodes running a mapper-only downloading job, downloading 30 GB takes roughly a minute, less time than the end of job work (which I assume is HDFS replication and bookkeeping). More mappers gives you more parallel downloads, of course. I'm using a Python REST client for S3, and only move data to or from S3 when Hadoop is done with it. Make sure your S3 buckets and EC2 instances are in the same zone.
Re: Hadoop & EC2
How can you ensure that the S3 buckets and EC2 instances belong to a certain zone? Ryan On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <[EMAIL PROTECTED]> wrote: > > On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: > >> Hi Tim, >> >> Are you mostly just processing/parsing textual log files? How many >> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >> many did you configure in your JobConf? Just trying to get an idea of >> what to expect in terms of performance. I'm noticing that it takes >> about 16 minutes to transfer about 15GB of textual uncompressed data >> from S3 into HDFS after the cluster has started with 15 nodes. I was >> expecting this to take a shorter amount of time, but maybe I'm >> incorrect in my assumptions. I am also noticing that it takes about 15 >> minutes to parse through the 15GB of data with a 15 node cluster. > > I'm seeing much faster speeds. With 128 nodes running a mapper-only > downloading job, downloading 30 GB takes roughly a minute, less time than > the end of job work (which I assume is HDFS replication and bookkeeping). > More mappers gives you more parallel downloads, of course. I'm using a > Python REST client for S3, and only move data to or from S3 when Hadoop is > done with it. > > Make sure your S3 buckets and EC2 instances are in the same zone. > >
Re: Hadoop & EC2
On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. I'm seeing much faster speeds. With 128 nodes running a mapper-only downloading job, downloading 30 GB takes roughly a minute, less time than the end of job work (which I assume is HDFS replication and bookkeeping). More mappers gives you more parallel downloads, of course. I'm using a Python REST client for S3, and only move data to or from S3 when Hadoop is done with it. Make sure your S3 buckets and EC2 instances are in the same zone.
Re: Hadoop & EC2
Tom White's blog has a nice piece on the different setups you can have for a hadoop cluster on EC2: http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html With the EBS volumes you can bring up and take down your cluster at will so you don't need to have 20 machines running all the time. We're still collecting performance numbers, but it's definitely faster to use EBS or local storage on EC2 than it is to use S3 (we were seeing 2Mb/s - 10Mb/s). M On Tue, Sep 2, 2008 at 8:59 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > tim robertson wrote: > > Incidentally, I have most of the basics of a "MapReduce-Lite" which I >> aim to port to use the exact Hadoop API since I am *only* working on >> 10's-100's GB of data and find that it is running really fine on my >> laptop and I don't need the distributed failover. My goal for that >> > > If it's going to be API-compatible with regular Hadoop, then I'm sure many > people will find it useful. E.g. many Nutch users bemoan the complexity of > distributed Hadoop setup, and they are not satisfied with the "local" > single-threaded physical-copy execution mode. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
Re: Hadoop & EC2
tim robertson wrote: Incidentally, I have most of the basics of a "MapReduce-Lite" which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that If it's going to be API-compatible with regular Hadoop, then I'm sure many people will find it useful. E.g. many Nutch users bemoan the complexity of distributed Hadoop setup, and they are not satisfied with the "local" single-threaded physical-copy execution mode. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Hadoop & EC2
Hi Tim, Thanks for responding -- I believe that I'll need the full power of Hadoop since I'll want this to scale well beyond 100GB of data. Thanks for sharing your experiences -- I'll definitely check out your blog. Thanks! Ryan On Tue, Sep 2, 2008 at 8:47 AM, tim robertson <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > I actually blogged my experience as it was my first usage of EC2: > http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html > > My input data was not log files but actually a dump if 150million > records from Mysql into about 13 columns of tab file data I believe. > It was a couple of months ago, but I remember thinking S3 was very slow... > > I ran some simple operations like distinct values of one column based > on another (species within a cell) and also did some Polygon analysis > since to do "is this point in this polygon" does not really scale too > well in PostGIS. > > Incidentally, I have most of the basics of a "MapReduce-Lite" which I > aim to port to use the exact Hadoop API since I am *only* working on > 10's-100's GB of data and find that it is running really fine on my > laptop and I don't need the distributed failover. My goal for that > code is for people like me who want to know that I can scale to > terrabyte processing, but don't need to take the plunge to full Hadoop > deployment yet, but will know that I can migrate the processing in the > future as things grow. It runs on the normal filesystem, and single > node only (e.g. multithreaded), and performs very quickly since it is > just doing java NIO bytebuffers in parallel on the underlying > filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a > seconds (simplest of simple map operations). For these small > datasets, you might find it useful - let me know if I should spend > time finishing it (Or submit help?) - it is really very simple. > > Cheers > > Tim > > > > On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> Hi Tim, >> >> Are you mostly just processing/parsing textual log files? How many >> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >> many did you configure in your JobConf? Just trying to get an idea of >> what to expect in terms of performance. I'm noticing that it takes >> about 16 minutes to transfer about 15GB of textual uncompressed data >> from S3 into HDFS after the cluster has started with 15 nodes. I was >> expecting this to take a shorter amount of time, but maybe I'm >> incorrect in my assumptions. I am also noticing that it takes about 15 >> minutes to parse through the 15GB of data with a 15 node cluster. >> >> Thanks, >> Ryan >> >> >> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >>> I have been processing only 100s GBs on EC2, not 1000's and using 20 >>> nodes and really only in exploration and testing phase right now. >>> >>> >>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >>>> Hi Ryan, >>>> >>>> Just a heads up, if you require more than the 20 node limit, Amazon >>>> provides a form to request a higher limit: >>>> >>>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>>> >>>> Andrew >>>> >>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>>> Hello all, >>>>> >>>>> I'm curious to see how many people are using EC2 to execute their >>>>> Hadoop cluster and map/reduce programs, and how many are using >>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>>> bit crippling when one wants to process many gigabytes of data. Has >>>>> anyone found this to be the case? How much data are people processing >>>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>> >>> >> >
Re: Hadoop & EC2
Hi Ryan, I actually blogged my experience as it was my first usage of EC2: http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html My input data was not log files but actually a dump if 150million records from Mysql into about 13 columns of tab file data I believe. It was a couple of months ago, but I remember thinking S3 was very slow... I ran some simple operations like distinct values of one column based on another (species within a cell) and also did some Polygon analysis since to do "is this point in this polygon" does not really scale too well in PostGIS. Incidentally, I have most of the basics of a "MapReduce-Lite" which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that code is for people like me who want to know that I can scale to terrabyte processing, but don't need to take the plunge to full Hadoop deployment yet, but will know that I can migrate the processing in the future as things grow. It runs on the normal filesystem, and single node only (e.g. multithreaded), and performs very quickly since it is just doing java NIO bytebuffers in parallel on the underlying filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a seconds (simplest of simple map operations). For these small datasets, you might find it useful - let me know if I should spend time finishing it (Or submit help?) - it is really very simple. Cheers Tim On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hi Tim, > > Are you mostly just processing/parsing textual log files? How many > maps/reduces did you configure in your hadoop-ec2-env.sh file? How > many did you configure in your JobConf? Just trying to get an idea of > what to expect in terms of performance. I'm noticing that it takes > about 16 minutes to transfer about 15GB of textual uncompressed data > from S3 into HDFS after the cluster has started with 15 nodes. I was > expecting this to take a shorter amount of time, but maybe I'm > incorrect in my assumptions. I am also noticing that it takes about 15 > minutes to parse through the 15GB of data with a 15 node cluster. > > Thanks, > Ryan > > > On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >> I have been processing only 100s GBs on EC2, not 1000's and using 20 >> nodes and really only in exploration and testing phase right now. >> >> >> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >>> Hi Ryan, >>> >>> Just a heads up, if you require more than the 20 node limit, Amazon >>> provides a form to request a higher limit: >>> >>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>> >>> Andrew >>> >>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>> Hello all, >>>> >>>> I'm curious to see how many people are using EC2 to execute their >>>> Hadoop cluster and map/reduce programs, and how many are using >>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>> bit crippling when one wants to process many gigabytes of data. Has >>>> anyone found this to be the case? How much data are people processing >>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>> >>>> Thanks, >>>> Ryan >>>> >>> >> >
Re: Hadoop & EC2
Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: > I have been processing only 100s GBs on EC2, not 1000's and using 20 > nodes and really only in exploration and testing phase right now. > > > On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >> Hi Ryan, >> >> Just a heads up, if you require more than the 20 node limit, Amazon >> provides a form to request a higher limit: >> >> http://www.amazon.com/gp/html-forms-controller/ec2-request >> >> Andrew >> >> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>> Hello all, >>> >>> I'm curious to see how many people are using EC2 to execute their >>> Hadoop cluster and map/reduce programs, and how many are using >>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>> bit crippling when one wants to process many gigabytes of data. Has >>> anyone found this to be the case? How much data are people processing >>> with their 20 node limit on EC2? Curious what the thoughts are... >>> >>> Thanks, >>> Ryan >>> >> >
Re: Hadoop & EC2
I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > Just a heads up, if you require more than the 20 node limit, Amazon > provides a form to request a higher limit: > > http://www.amazon.com/gp/html-forms-controller/ec2-request > > Andrew > > On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> Hello all, >> >> I'm curious to see how many people are using EC2 to execute their >> Hadoop cluster and map/reduce programs, and how many are using >> home-grown datacenters. It seems like the 20 node limit with EC2 is a >> bit crippling when one wants to process many gigabytes of data. Has >> anyone found this to be the case? How much data are people processing >> with their 20 node limit on EC2? Curious what the thoughts are... >> >> Thanks, >> Ryan >> >
Re: Hadoop & EC2
Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hello all, > > I'm curious to see how many people are using EC2 to execute their > Hadoop cluster and map/reduce programs, and how many are using > home-grown datacenters. It seems like the 20 node limit with EC2 is a > bit crippling when one wants to process many gigabytes of data. Has > anyone found this to be the case? How much data are people processing > with their 20 node limit on EC2? Curious what the thoughts are... > > Thanks, > Ryan >
Hadoop & EC2
Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
hadoop-ec2 log access
I'm unable to access my logs with the JobTracker/TaskTracker web interface for a Hadoop job running on Amazon EC2. The URLs given for the task logs are of the form: http://domu-[...].compute-1.internal:50060/ The Hadoop-EC2 docs suggest that I should be able to get onto port 50060 for the master and the task boxes, is there a way to reach the logs? Maybe by finding out what IP address to use? Or is there a way to see the logs on the master? When I run pseudo-distributed, the logs show up in the logs/userlogs subdirectory of the Hadoop root, but not on my EC2 instances. I'm running a streaming job, so I need to be able to look at the stderr of my tasks. Thanks for any help.