Re: Optimal setup for a test problem

alex kamil Tue, 13 Apr 2010 09:59:08 -0700

Andrew,

here are some tips for hadoop runtime config:
http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html
also


here are some results from my cluster (using 1GE NICs, Fiber), Dell 5500,
24GB, 8-core (16 hypervised), JBOD, i saw slightly better numbers on a
different 4-nodes cluster with HP G5s


----- TestDFSIO ----- : write
           Date & time: Wed Mar 31 02:28:59 EDT 2010
       Number of files: 10
Total MBytes processed: 10000
     Throughput mb/sec: 5.615639781416837
Average IO rate mb/sec: 5.631219863891602
 IO rate std deviation: 0.2928237500022612
    Test exec time sec: 219.095

----- TestDFSIO ----- : read
           Date & time: Wed Mar 31 02:32:21 EDT 2010
       Number of files: 10
Total MBytes processed: 10000
     Throughput mb/sec: 10.662958800459787
Average IO rate mb/sec: 13.391314506530762
 IO rate std deviation: 8.181072283752508
    Test exec time sec: 157.752

thanks
Alex

On Mon, Apr 12, 2010 at 4:19 PM, Andrew Nguyen <and...@ucsfcti.org> wrote:

> Correction, they are 100Mbps NIC's...
>
> iperf shows that we're getting about 95 Mbits/sec from one node to another.
>
> On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote:
>
> > @Todd:
> >
> > I do need the sorting behavior, eventually.  However, I'll try it with
> zero reduce jobs to see.
> >
> > @Alex:
> >
> > Yes, I was planning on incrementally building my mapper and reducer
> functions so currently, the mapper takes the value and multiplies by the
> gain and adds the offset and outputs a new key/value pair.
> >
> > Started to run the tests but didn't know about how long it should take
> with the parameters you listed below.  However, it seemed like there was no
> progress being made.  Ran it with a increasing parameter values and results
> are included below:
> >
> > Here is a run with nrFiles 1 and fileSize 10
> >
> > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar
> TestDFSIO -write -nrFiles 1 -fileSize 10
> > TestFDSIO.0.0.4
> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1
> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10
> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 1000000
> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10
> mega bytes, 1 files
> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for:
> 1 files
> > 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> > 10/04/12 11:57:19 INFO mapred.JobClient: Running job:
> job_201004111107_0017
> > 10/04/12 11:57:20 INFO mapred.JobClient:  map 0% reduce 0%
> > 10/04/12 11:57:27 INFO mapred.JobClient:  map 100% reduce 0%
> > 10/04/12 11:57:39 INFO mapred.JobClient:  map 100% reduce 100%
> > 10/04/12 11:57:41 INFO mapred.JobClient: Job complete:
> job_201004111107_0017
> > 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18
> > 10/04/12 11:57:41 INFO mapred.JobClient:   Job Counters
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Launched reduce tasks=1
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Launched map tasks=1
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Data-local map tasks=1
> > 10/04/12 11:57:41 INFO mapred.JobClient:   FileSystemCounters
> > 10/04/12 11:57:41 INFO mapred.JobClient:     FILE_BYTES_READ=98
> > 10/04/12 11:57:41 INFO mapred.JobClient:     HDFS_BYTES_READ=113
> > 10/04/12 11:57:41 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=228
> > 10/04/12 11:57:41 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=10485832
> > 10/04/12 11:57:41 INFO mapred.JobClient:   Map-Reduce Framework
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Reduce input groups=5
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Combine output records=0
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Map input records=1
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Reduce shuffle bytes=0
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Reduce output records=5
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Spilled Records=10
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Map output bytes=82
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Map input bytes=27
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Combine input records=0
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Map output records=5
> > 10/04/12 11:57:41 INFO mapred.JobClient:     Reduce input records=5
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- :
> write
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:            Date & time:
> Mon Apr 12 11:57:41 PST 2010
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:        Number of files: 1
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:      Throughput mb/sec:
> 8.710801393728223
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec:
> 8.710801124572754
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:  IO rate std deviation:
> 0.0017763302275007867
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:     Test exec time sec:
> 22.757
> > 10/04/12 11:57:41 INFO mapred.FileInputFormat:
> >
> > Here is a run with nrFiles 10 and fileSize 100:
> >
> > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar
> TestDFSIO -write -nrFiles 10 -fileSize 100
> > TestFDSIO.0.0.4
> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: nrFiles = 10
> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: fileSize (MB) = 100
> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: bufferSize = 1000000
> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: creating control file: 100
> mega bytes, 10 files
> > 10/04/12 11:58:55 INFO mapred.FileInputFormat: created control files for:
> 10 files
> > 10/04/12 11:58:55 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> > 10/04/12 11:58:55 INFO mapred.FileInputFormat: Total input paths to
> process : 10
> > 10/04/12 11:58:55 INFO mapred.JobClient: Running job:
> job_201004111107_0018
> > 10/04/12 11:58:56 INFO mapred.JobClient:  map 0% reduce 0%
> > 10/04/12 11:59:45 INFO mapred.JobClient:  map 10% reduce 0%
> > 10/04/12 11:59:54 INFO mapred.JobClient:  map 10% reduce 3%
> > 10/04/12 11:59:59 INFO mapred.JobClient:  map 20% reduce 3%
> > 10/04/12 12:00:01 INFO mapred.JobClient:  map 40% reduce 3%
> > 10/04/12 12:00:03 INFO mapred.JobClient:  map 50% reduce 3%
> > 10/04/12 12:00:08 INFO mapred.JobClient:  map 60% reduce 3%
> > 10/04/12 12:00:09 INFO mapred.JobClient:  map 60% reduce 16%
> > 10/04/12 12:00:11 INFO mapred.JobClient:  map 70% reduce 16%
> > 10/04/12 12:00:18 INFO mapred.JobClient:  map 70% reduce 20%
> > 10/04/12 12:00:23 INFO mapred.JobClient:  map 80% reduce 20%
> > 10/04/12 12:00:24 INFO mapred.JobClient:  map 80% reduce 23%
> > 10/04/12 12:00:26 INFO mapred.JobClient:  map 90% reduce 23%
> > 10/04/12 12:00:30 INFO mapred.JobClient:  map 100% reduce 23%
> > 10/04/12 12:00:33 INFO mapred.JobClient:  map 100% reduce 26%
> > 10/04/12 12:00:39 INFO mapred.JobClient:  map 100% reduce 100%
> > 10/04/12 12:00:41 INFO mapred.JobClient: Job complete:
> job_201004111107_0018
> > 10/04/12 12:00:41 INFO mapred.JobClient: Counters: 18
> > 10/04/12 12:00:41 INFO mapred.JobClient:   Job Counters
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Launched reduce tasks=1
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Launched map tasks=14
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Data-local map tasks=14
> > 10/04/12 12:00:41 INFO mapred.JobClient:   FileSystemCounters
> > 10/04/12 12:00:41 INFO mapred.JobClient:     FILE_BYTES_READ=961
> > 10/04/12 12:00:41 INFO mapred.JobClient:     HDFS_BYTES_READ=1130
> > 10/04/12 12:00:41 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2296
> > 10/04/12 12:00:41 INFO mapred.JobClient:
> HDFS_BYTES_WRITTEN=1048576079
> > 10/04/12 12:00:41 INFO mapred.JobClient:   Map-Reduce Framework
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Reduce input groups=5
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Combine output records=0
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Map input records=10
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Reduce shuffle bytes=914
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Reduce output records=5
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Spilled Records=100
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Map output bytes=855
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Map input bytes=270
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Combine input records=0
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Map output records=50
> > 10/04/12 12:00:41 INFO mapred.JobClient:     Reduce input records=50
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- :
> write
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:            Date & time:
> Mon Apr 12 12:00:41 PST 2010
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:        Number of files: 10
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Total MBytes processed:
> 1000
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:      Throughput mb/sec:
> 1.9073850132944736
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Average IO rate mb/sec:
> 2.1501593589782715
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:  IO rate std deviation:
> 0.8994861001170683
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:     Test exec time sec:
> 106.45
> > 10/04/12 12:00:41 INFO mapred.FileInputFormat:
> >
> > The throughput is a lot lower for 10/100 vs 1/10...
> >
> > Here's some rough specs of our cluster:
> >
> > 5 identically spec'ed nodes, each has:
> >
> > 2 GB RAM
> > Pentium 4 3.0G with HT
> > 250GB HDD on PATA
> > 10Mbps NIC
> >
> > They are on a private network on a Dell switch.
> >
> > Thanks!
> >
> > --Andrew
> >
> > On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
> >
> >> Andrew,
> >>
> >> I would also suggest to run DFSIO benchmark to isolate io related issues
> >>
> >> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize
> 1000
> >> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize
> 1000
> >>
> >> there are additional tests specific for mapreduce -  run  "hadoop jar
> hadoop-0.20.2-test.jar" for the complete list
> >>
> >> 45 min for mapping 6GB on 5 nodes is way too high assuming your
> gain/offset conversion is a simple algebraic manipulation
> >>
> >> it takes less than 5 min  to run a simple mapper (using streaming) on a
> 4 nodes cluster on something like 10GB, the mapper i used was an awk command
> extracting <key:value> pair from a log (no reducer)
> >>
> >> Thanks
> >> Alex
> >>
> >>
> >>
> >>
> >> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> wrote:
> >> Hi Andrew,
> >>
> >> Do you need the sorting behavior that having an identity reducer gives
> you?
> >> If not, set the number of reduce tasks to 0 and you'll end up with a map
> >> only job, which should be significantly faster.
> >>
> >> -Todd
> >>
> >> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
> >> andrew-lists-had...@ucsfcti.org> wrote:
> >>
> >>> Hello,
> >>>
> >>> I recently setup a 5 node cluster (1 master, 4 slaves) and am looking
> to
> >>> use it to process high volumes of patient physiologic data.  As an
> initial
> >>> exercise to gain a better understanding, I have attempted to run the
> >>> following problem (which isn't the type of problem that Hadoop was
> really
> >>> designed for, as is my understanding).
> >>>
> >>> I have a 6G data file, that contains key/value of <sample number,
> sample
> >>> value>.  I'd like to convert the values based on a gain/offset to their
> >>> physical units.  I've setup a MapReduce job using streaming where the
> mapper
> >>> does the conversion, and the reducer is just an identity reducer.
>  Based on
> >>> other threads on the mailing list, my initial results are consistent in
> the
> >>> fact that it takes considerably more time to process this in Hadoop
> then it
> >>> is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
> single 6G
> >>> file and it looks like the file is being split into 101 map tasks.
>  This is
> >>> consistent with the 64M block sizes.
> >>>
> >>> So my questions are:
> >>>
> >>> * Would it help to increase the block size to 128M?  Or, decrease the
> block
> >>> size?  What are some key factors to think about with this question?
> >>> * Are there any other optimizations that I could employ?  I have looked
> >>> into LzoCompression but I'd like to still work without compression
> since the
> >>> single thread job that I'm comparing to doesn't use any sort of
> compression.
> >>> I know I'm comparing apples to pears a little here so please feel free
> to
> >>> correct this assumption.
> >>> * Is Hadoop really only good for jobs where the data doesn't fit on a
> >>> single node?  At some level, I assume that it can still speedup jobs
> that do
> >>> fit on one node, if only because you are performing tasks in parallel.
> >>>
> >>> Thanks!
> >>>
> >>> --Andrew
> >>
> >>
> >>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
> >
>
>

Re: Optimal setup for a test problem

Reply via email to