Re: Optimal setup for a test problem

alex kamil Tue, 13 Apr 2010 10:03:05 -0700

also http://www.slideshare.net/cloudera/hw09-optimizing-hadoop-deployments


On Tue, Apr 13, 2010 at 12:58 PM, alex kamil <alex.ka...@gmail.com> wrote:

> Andrew,
>
> here are some tips for hadoop runtime config:
> http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html
> also
>
> here are some results from my cluster (using 1GE NICs, Fiber), Dell 5500,
> 24GB, 8-core (16 hypervised), JBOD, i saw slightly better numbers on a
> different 4-nodes cluster with HP G5s
>
>
> ----- TestDFSIO ----- : write
>            Date & time: Wed Mar 31 02:28:59 EDT 2010
>        Number of files: 10
> Total MBytes processed: 10000
>      Throughput mb/sec: 5.615639781416837
> Average IO rate mb/sec: 5.631219863891602
>  IO rate std deviation: 0.2928237500022612
>     Test exec time sec: 219.095
>
> ----- TestDFSIO ----- : read
>            Date & time: Wed Mar 31 02:32:21 EDT 2010
>        Number of files: 10
> Total MBytes processed: 10000
>      Throughput mb/sec: 10.662958800459787
> Average IO rate mb/sec: 13.391314506530762
>  IO rate std deviation: 8.181072283752508
>     Test exec time sec: 157.752
>
> thanks
> Alex
>
> On Mon, Apr 12, 2010 at 4:19 PM, Andrew Nguyen <and...@ucsfcti.org> wrote:
>
>> Correction, they are 100Mbps NIC's...
>>
>> iperf shows that we're getting about 95 Mbits/sec from one node to
>> another.
>>
>> On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote:
>>
>> > @Todd:
>> >
>> > I do need the sorting behavior, eventually.  However, I'll try it with
>> zero reduce jobs to see.
>> >
>> > @Alex:
>> >
>> > Yes, I was planning on incrementally building my mapper and reducer
>> functions so currently, the mapper takes the value and multiplies by the
>> gain and adds the offset and outputs a new key/value pair.
>> >
>> > Started to run the tests but didn't know about how long it should take
>> with the parameters you listed below.  However, it seemed like there was no
>> progress being made.  Ran it with a increasing parameter values and results
>> are included below:
>> >
>> > Here is a run with nrFiles 1 and fileSize 10
>> >
>> > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar
>> TestDFSIO -write -nrFiles 1 -fileSize 10
>> > TestFDSIO.0.0.4
>> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1
>> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10
>> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 1000000
>> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10
>> mega bytes, 1 files
>> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files
>> for: 1 files
>> > 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same.
>> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> > 10/04/12 11:57:19 INFO mapred.JobClient: Running job:
>> job_201004111107_0017
>> > 10/04/12 11:57:20 INFO mapred.JobClient:  map 0% reduce 0%
>> > 10/04/12 11:57:27 INFO mapred.JobClient:  map 100% reduce 0%
>> > 10/04/12 11:57:39 INFO mapred.JobClient:  map 100% reduce 100%
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Job complete:
>> job_201004111107_0017
>> > 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18
>> > 10/04/12 11:57:41 INFO mapred.JobClient:   Job Counters
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Launched reduce tasks=1
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Launched map tasks=1
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Data-local map tasks=1
>> > 10/04/12 11:57:41 INFO mapred.JobClient:   FileSystemCounters
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     FILE_BYTES_READ=98
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     HDFS_BYTES_READ=113
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=228
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=10485832
>> > 10/04/12 11:57:41 INFO mapred.JobClient:   Map-Reduce Framework
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Reduce input groups=5
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Combine output records=0
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Map input records=1
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Reduce shuffle bytes=0
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Reduce output records=5
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Spilled Records=10
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Map output bytes=82
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Map input bytes=27
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Combine input records=0
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Map output records=5
>> > 10/04/12 11:57:41 INFO mapred.JobClient:     Reduce input records=5
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- :
>> write
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:            Date & time:
>> Mon Apr 12 11:57:41 PST 2010
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:        Number of files: 1
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed:
>> 10
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:      Throughput mb/sec:
>> 8.710801393728223
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec:
>> 8.710801124572754
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:  IO rate std deviation:
>> 0.0017763302275007867
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:     Test exec time sec:
>> 22.757
>> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:
>> >
>> > Here is a run with nrFiles 10 and fileSize 100:
>> >
>> > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar
>> TestDFSIO -write -nrFiles 10 -fileSize 100
>> > TestFDSIO.0.0.4
>> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: nrFiles = 10
>> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: fileSize (MB) = 100
>> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: bufferSize = 1000000
>> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: creating control file:
>> 100 mega bytes, 10 files
>> > 10/04/12 11:58:55 INFO mapred.FileInputFormat: created control files
>> for: 10 files
>> > 10/04/12 11:58:55 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same.
>> > 10/04/12 11:58:55 INFO mapred.FileInputFormat: Total input paths to
>> process : 10
>> > 10/04/12 11:58:55 INFO mapred.JobClient: Running job:
>> job_201004111107_0018
>> > 10/04/12 11:58:56 INFO mapred.JobClient:  map 0% reduce 0%
>> > 10/04/12 11:59:45 INFO mapred.JobClient:  map 10% reduce 0%
>> > 10/04/12 11:59:54 INFO mapred.JobClient:  map 10% reduce 3%
>> > 10/04/12 11:59:59 INFO mapred.JobClient:  map 20% reduce 3%
>> > 10/04/12 12:00:01 INFO mapred.JobClient:  map 40% reduce 3%
>> > 10/04/12 12:00:03 INFO mapred.JobClient:  map 50% reduce 3%
>> > 10/04/12 12:00:08 INFO mapred.JobClient:  map 60% reduce 3%
>> > 10/04/12 12:00:09 INFO mapred.JobClient:  map 60% reduce 16%
>> > 10/04/12 12:00:11 INFO mapred.JobClient:  map 70% reduce 16%
>> > 10/04/12 12:00:18 INFO mapred.JobClient:  map 70% reduce 20%
>> > 10/04/12 12:00:23 INFO mapred.JobClient:  map 80% reduce 20%
>> > 10/04/12 12:00:24 INFO mapred.JobClient:  map 80% reduce 23%
>> > 10/04/12 12:00:26 INFO mapred.JobClient:  map 90% reduce 23%
>> > 10/04/12 12:00:30 INFO mapred.JobClient:  map 100% reduce 23%
>> > 10/04/12 12:00:33 INFO mapred.JobClient:  map 100% reduce 26%
>> > 10/04/12 12:00:39 INFO mapred.JobClient:  map 100% reduce 100%
>> > 10/04/12 12:00:41 INFO mapred.JobClient: Job complete:
>> job_201004111107_0018
>> > 10/04/12 12:00:41 INFO mapred.JobClient: Counters: 18
>> > 10/04/12 12:00:41 INFO mapred.JobClient:   Job Counters
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Launched reduce tasks=1
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Launched map tasks=14
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Data-local map tasks=14
>> > 10/04/12 12:00:41 INFO mapred.JobClient:   FileSystemCounters
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     FILE_BYTES_READ=961
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     HDFS_BYTES_READ=1130
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2296
>> > 10/04/12 12:00:41 INFO mapred.JobClient:
>> HDFS_BYTES_WRITTEN=1048576079
>> > 10/04/12 12:00:41 INFO mapred.JobClient:   Map-Reduce Framework
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Reduce input groups=5
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Combine output records=0
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Map input records=10
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Reduce shuffle bytes=914
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Reduce output records=5
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Spilled Records=100
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Map output bytes=855
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Map input bytes=270
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Combine input records=0
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Map output records=50
>> > 10/04/12 12:00:41 INFO mapred.JobClient:     Reduce input records=50
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- :
>> write
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:            Date & time:
>> Mon Apr 12 12:00:41 PST 2010
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:        Number of files:
>> 10
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Total MBytes processed:
>> 1000
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:      Throughput mb/sec:
>> 1.9073850132944736
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Average IO rate mb/sec:
>> 2.1501593589782715
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:  IO rate std deviation:
>> 0.8994861001170683
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:     Test exec time sec:
>> 106.45
>> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:
>> >
>> > The throughput is a lot lower for 10/100 vs 1/10...
>> >
>> > Here's some rough specs of our cluster:
>> >
>> > 5 identically spec'ed nodes, each has:
>> >
>> > 2 GB RAM
>> > Pentium 4 3.0G with HT
>> > 250GB HDD on PATA
>> > 10Mbps NIC
>> >
>> > They are on a private network on a Dell switch.
>> >
>> > Thanks!
>> >
>> > --Andrew
>> >
>> > On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
>> >
>> >> Andrew,
>> >>
>> >> I would also suggest to run DFSIO benchmark to isolate io related
>> issues
>> >>
>> >> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10
>> -fileSize 1000
>> >> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize
>> 1000
>> >>
>> >> there are additional tests specific for mapreduce -  run  "hadoop jar
>> hadoop-0.20.2-test.jar" for the complete list
>> >>
>> >> 45 min for mapping 6GB on 5 nodes is way too high assuming your
>> gain/offset conversion is a simple algebraic manipulation
>> >>
>> >> it takes less than 5 min  to run a simple mapper (using streaming) on a
>> 4 nodes cluster on something like 10GB, the mapper i used was an awk command
>> extracting <key:value> pair from a log (no reducer)
>> >>
>> >> Thanks
>> >> Alex
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com>
>> wrote:
>> >> Hi Andrew,
>> >>
>> >> Do you need the sorting behavior that having an identity reducer gives
>> you?
>> >> If not, set the number of reduce tasks to 0 and you'll end up with a
>> map
>> >> only job, which should be significantly faster.
>> >>
>> >> -Todd
>> >>
>> >> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
>> >> andrew-lists-had...@ucsfcti.org> wrote:
>> >>
>> >>> Hello,
>> >>>
>> >>> I recently setup a 5 node cluster (1 master, 4 slaves) and am looking
>> to
>> >>> use it to process high volumes of patient physiologic data.  As an
>> initial
>> >>> exercise to gain a better understanding, I have attempted to run the
>> >>> following problem (which isn't the type of problem that Hadoop was
>> really
>> >>> designed for, as is my understanding).
>> >>>
>> >>> I have a 6G data file, that contains key/value of <sample number,
>> sample
>> >>> value>.  I'd like to convert the values based on a gain/offset to
>> their
>> >>> physical units.  I've setup a MapReduce job using streaming where the
>> mapper
>> >>> does the conversion, and the reducer is just an identity reducer.
>>  Based on
>> >>> other threads on the mailing list, my initial results are consistent
>> in the
>> >>> fact that it takes considerably more time to process this in Hadoop
>> then it
>> >>> is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
>> single 6G
>> >>> file and it looks like the file is being split into 101 map tasks.
>>  This is
>> >>> consistent with the 64M block sizes.
>> >>>
>> >>> So my questions are:
>> >>>
>> >>> * Would it help to increase the block size to 128M?  Or, decrease the
>> block
>> >>> size?  What are some key factors to think about with this question?
>> >>> * Are there any other optimizations that I could employ?  I have
>> looked
>> >>> into LzoCompression but I'd like to still work without compression
>> since the
>> >>> single thread job that I'm comparing to doesn't use any sort of
>> compression.
>> >>> I know I'm comparing apples to pears a little here so please feel free
>> to
>> >>> correct this assumption.
>> >>> * Is Hadoop really only good for jobs where the data doesn't fit on a
>> >>> single node?  At some level, I assume that it can still speedup jobs
>> that do
>> >>> fit on one node, if only because you are performing tasks in parallel.
>> >>>
>> >>> Thanks!
>> >>>
>> >>> --Andrew
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>> >
>>
>>
>

Re: Optimal setup for a test problem

Reply via email to