Hi Tom, This clears up my questions.
Thanks! Ryan On Thu, Sep 4, 2008 at 9:21 AM, Tom White <[EMAIL PROTECTED]> wrote: > On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> I'm noticing that using bin/hadoop fs -put ... svn://... is uploading >> multi-gigabyte files in ~64MB chunks. > > That's because S3Filesystem stores files as 64MB blocks on S3. > >> Then, when this is copied from >> S3 into HDFS using bin/hadoop distcp. Once the files are there and the >> job begins, it looks like it's breaking up the 4 multigigabyte text >> files into about 225 maps. Does this mean that each map is roughly >> processing 64MB of data each? > > Yes, HDFS stores files as 64MB blocks too, and map input is split by > default so each map processes one block. > >>If so, is there any way to change this >> so that I can get my map tasks to process more data at a time? I'm >> curious if this will shorten the time it takes to run the program. > > You could try increasing the HDFS block size. 128MB is actually > usually a better value, for this very reason. > > In the future https://issues.apache.org/jira/browse/HADOOP-2560 will > help here too. > >> >> Tom, in your article about Hadoop + EC2 you mention processing about >> 100GB of logs in under 6 minutes or so. > > In this article: > http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873, > it took 35 minutes to run the job. I'm planning on doing some > benchmarking on EC2 fairly soon, which should help us improve the > performance of Hadoop on EC2. It's worth remarking that this was > running on small instances. The larger instances perform a lot better > in my experience. > >> Do you remember how many EC2 >> instances you had running, and also how many map tasks did you have to >> operate on the 100GB? Was each map task handling about 1GB each? > > I was running 20 nodes, and each map task was handling a HDFS block, 64MB. > > Hope this helps, > > Tom >
