Andrew, here are some tips for hadoop runtime config: http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html also
here are some results from my cluster (using 1GE NICs, Fiber), Dell 5500, 24GB, 8-core (16 hypervised), JBOD, i saw slightly better numbers on a different 4-nodes cluster with HP G5s ----- TestDFSIO ----- : write Date & time: Wed Mar 31 02:28:59 EDT 2010 Number of files: 10 Total MBytes processed: 10000 Throughput mb/sec: 5.615639781416837 Average IO rate mb/sec: 5.631219863891602 IO rate std deviation: 0.2928237500022612 Test exec time sec: 219.095 ----- TestDFSIO ----- : read Date & time: Wed Mar 31 02:32:21 EDT 2010 Number of files: 10 Total MBytes processed: 10000 Throughput mb/sec: 10.662958800459787 Average IO rate mb/sec: 13.391314506530762 IO rate std deviation: 8.181072283752508 Test exec time sec: 157.752 thanks Alex On Mon, Apr 12, 2010 at 4:19 PM, Andrew Nguyen <and...@ucsfcti.org> wrote: > Correction, they are 100Mbps NIC's... > > iperf shows that we're getting about 95 Mbits/sec from one node to another. > > On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote: > > > @Todd: > > > > I do need the sorting behavior, eventually. However, I'll try it with > zero reduce jobs to see. > > > > @Alex: > > > > Yes, I was planning on incrementally building my mapper and reducer > functions so currently, the mapper takes the value and multiplies by the > gain and adds the offset and outputs a new key/value pair. > > > > Started to run the tests but didn't know about how long it should take > with the parameters you listed below. However, it seemed like there was no > progress being made. Ran it with a increasing parameter values and results > are included below: > > > > Here is a run with nrFiles 1 and fileSize 10 > > > > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar > TestDFSIO -write -nrFiles 1 -fileSize 10 > > TestFDSIO.0.0.4 > > 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1 > > 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10 > > 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 1000000 > > 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10 > mega bytes, 1 files > > 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for: > 1 files > > 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the same. > > 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to > process : 1 > > 10/04/12 11:57:19 INFO mapred.JobClient: Running job: > job_201004111107_0017 > > 10/04/12 11:57:20 INFO mapred.JobClient: map 0% reduce 0% > > 10/04/12 11:57:27 INFO mapred.JobClient: map 100% reduce 0% > > 10/04/12 11:57:39 INFO mapred.JobClient: map 100% reduce 100% > > 10/04/12 11:57:41 INFO mapred.JobClient: Job complete: > job_201004111107_0017 > > 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18 > > 10/04/12 11:57:41 INFO mapred.JobClient: Job Counters > > 10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1 > > 10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1 > > 10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1 > > 10/04/12 11:57:41 INFO mapred.JobClient: FileSystemCounters > > 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98 > > 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113 > > 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228 > > 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832 > > 10/04/12 11:57:41 INFO mapred.JobClient: Map-Reduce Framework > > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5 > > 10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0 > > 10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1 > > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0 > > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5 > > 10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10 > > 10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82 > > 10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27 > > 10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0 > > 10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5 > > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5 > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : > write > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Date & time: > Mon Apr 12 11:57:41 PST 2010 > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Number of files: 1 > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10 > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Throughput mb/sec: > 8.710801393728223 > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: > 8.710801124572754 > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: IO rate std deviation: > 0.0017763302275007867 > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Test exec time sec: > 22.757 > > 10/04/12 11:57:41 INFO mapred.FileInputFormat: > > > > Here is a run with nrFiles 10 and fileSize 100: > > > > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar > TestDFSIO -write -nrFiles 10 -fileSize 100 > > TestFDSIO.0.0.4 > > 10/04/12 11:58:54 INFO mapred.FileInputFormat: nrFiles = 10 > > 10/04/12 11:58:54 INFO mapred.FileInputFormat: fileSize (MB) = 100 > > 10/04/12 11:58:54 INFO mapred.FileInputFormat: bufferSize = 1000000 > > 10/04/12 11:58:54 INFO mapred.FileInputFormat: creating control file: 100 > mega bytes, 10 files > > 10/04/12 11:58:55 INFO mapred.FileInputFormat: created control files for: > 10 files > > 10/04/12 11:58:55 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the same. > > 10/04/12 11:58:55 INFO mapred.FileInputFormat: Total input paths to > process : 10 > > 10/04/12 11:58:55 INFO mapred.JobClient: Running job: > job_201004111107_0018 > > 10/04/12 11:58:56 INFO mapred.JobClient: map 0% reduce 0% > > 10/04/12 11:59:45 INFO mapred.JobClient: map 10% reduce 0% > > 10/04/12 11:59:54 INFO mapred.JobClient: map 10% reduce 3% > > 10/04/12 11:59:59 INFO mapred.JobClient: map 20% reduce 3% > > 10/04/12 12:00:01 INFO mapred.JobClient: map 40% reduce 3% > > 10/04/12 12:00:03 INFO mapred.JobClient: map 50% reduce 3% > > 10/04/12 12:00:08 INFO mapred.JobClient: map 60% reduce 3% > > 10/04/12 12:00:09 INFO mapred.JobClient: map 60% reduce 16% > > 10/04/12 12:00:11 INFO mapred.JobClient: map 70% reduce 16% > > 10/04/12 12:00:18 INFO mapred.JobClient: map 70% reduce 20% > > 10/04/12 12:00:23 INFO mapred.JobClient: map 80% reduce 20% > > 10/04/12 12:00:24 INFO mapred.JobClient: map 80% reduce 23% > > 10/04/12 12:00:26 INFO mapred.JobClient: map 90% reduce 23% > > 10/04/12 12:00:30 INFO mapred.JobClient: map 100% reduce 23% > > 10/04/12 12:00:33 INFO mapred.JobClient: map 100% reduce 26% > > 10/04/12 12:00:39 INFO mapred.JobClient: map 100% reduce 100% > > 10/04/12 12:00:41 INFO mapred.JobClient: Job complete: > job_201004111107_0018 > > 10/04/12 12:00:41 INFO mapred.JobClient: Counters: 18 > > 10/04/12 12:00:41 INFO mapred.JobClient: Job Counters > > 10/04/12 12:00:41 INFO mapred.JobClient: Launched reduce tasks=1 > > 10/04/12 12:00:41 INFO mapred.JobClient: Launched map tasks=14 > > 10/04/12 12:00:41 INFO mapred.JobClient: Data-local map tasks=14 > > 10/04/12 12:00:41 INFO mapred.JobClient: FileSystemCounters > > 10/04/12 12:00:41 INFO mapred.JobClient: FILE_BYTES_READ=961 > > 10/04/12 12:00:41 INFO mapred.JobClient: HDFS_BYTES_READ=1130 > > 10/04/12 12:00:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2296 > > 10/04/12 12:00:41 INFO mapred.JobClient: > HDFS_BYTES_WRITTEN=1048576079 > > 10/04/12 12:00:41 INFO mapred.JobClient: Map-Reduce Framework > > 10/04/12 12:00:41 INFO mapred.JobClient: Reduce input groups=5 > > 10/04/12 12:00:41 INFO mapred.JobClient: Combine output records=0 > > 10/04/12 12:00:41 INFO mapred.JobClient: Map input records=10 > > 10/04/12 12:00:41 INFO mapred.JobClient: Reduce shuffle bytes=914 > > 10/04/12 12:00:41 INFO mapred.JobClient: Reduce output records=5 > > 10/04/12 12:00:41 INFO mapred.JobClient: Spilled Records=100 > > 10/04/12 12:00:41 INFO mapred.JobClient: Map output bytes=855 > > 10/04/12 12:00:41 INFO mapred.JobClient: Map input bytes=270 > > 10/04/12 12:00:41 INFO mapred.JobClient: Combine input records=0 > > 10/04/12 12:00:41 INFO mapred.JobClient: Map output records=50 > > 10/04/12 12:00:41 INFO mapred.JobClient: Reduce input records=50 > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : > write > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Date & time: > Mon Apr 12 12:00:41 PST 2010 > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Number of files: 10 > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Total MBytes processed: > 1000 > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Throughput mb/sec: > 1.9073850132944736 > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: > 2.1501593589782715 > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: IO rate std deviation: > 0.8994861001170683 > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Test exec time sec: > 106.45 > > 10/04/12 12:00:41 INFO mapred.FileInputFormat: > > > > The throughput is a lot lower for 10/100 vs 1/10... > > > > Here's some rough specs of our cluster: > > > > 5 identically spec'ed nodes, each has: > > > > 2 GB RAM > > Pentium 4 3.0G with HT > > 250GB HDD on PATA > > 10Mbps NIC > > > > They are on a private network on a Dell switch. > > > > Thanks! > > > > --Andrew > > > > On Apr 12, 2010, at 11:58 AM, alex kamil wrote: > > > >> Andrew, > >> > >> I would also suggest to run DFSIO benchmark to isolate io related issues > >> > >> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize > 1000 > >> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize > 1000 > >> > >> there are additional tests specific for mapreduce - run "hadoop jar > hadoop-0.20.2-test.jar" for the complete list > >> > >> 45 min for mapping 6GB on 5 nodes is way too high assuming your > gain/offset conversion is a simple algebraic manipulation > >> > >> it takes less than 5 min to run a simple mapper (using streaming) on a > 4 nodes cluster on something like 10GB, the mapper i used was an awk command > extracting <key:value> pair from a log (no reducer) > >> > >> Thanks > >> Alex > >> > >> > >> > >> > >> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> Hi Andrew, > >> > >> Do you need the sorting behavior that having an identity reducer gives > you? > >> If not, set the number of reduce tasks to 0 and you'll end up with a map > >> only job, which should be significantly faster. > >> > >> -Todd > >> > >> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen < > >> andrew-lists-had...@ucsfcti.org> wrote: > >> > >>> Hello, > >>> > >>> I recently setup a 5 node cluster (1 master, 4 slaves) and am looking > to > >>> use it to process high volumes of patient physiologic data. As an > initial > >>> exercise to gain a better understanding, I have attempted to run the > >>> following problem (which isn't the type of problem that Hadoop was > really > >>> designed for, as is my understanding). > >>> > >>> I have a 6G data file, that contains key/value of <sample number, > sample > >>> value>. I'd like to convert the values based on a gain/offset to their > >>> physical units. I've setup a MapReduce job using streaming where the > mapper > >>> does the conversion, and the reducer is just an identity reducer. > Based on > >>> other threads on the mailing list, my initial results are consistent in > the > >>> fact that it takes considerably more time to process this in Hadoop > then it > >>> is on my Macbook pro (45 minutes vs. 13 minutes). The input is a > single 6G > >>> file and it looks like the file is being split into 101 map tasks. > This is > >>> consistent with the 64M block sizes. > >>> > >>> So my questions are: > >>> > >>> * Would it help to increase the block size to 128M? Or, decrease the > block > >>> size? What are some key factors to think about with this question? > >>> * Are there any other optimizations that I could employ? I have looked > >>> into LzoCompression but I'd like to still work without compression > since the > >>> single thread job that I'm comparing to doesn't use any sort of > compression. > >>> I know I'm comparing apples to pears a little here so please feel free > to > >>> correct this assumption. > >>> * Is Hadoop really only good for jobs where the data doesn't fit on a > >>> single node? At some level, I assume that it can still speedup jobs > that do > >>> fit on one node, if only because you are performing tasks in parallel. > >>> > >>> Thanks! > >>> > >>> --Andrew > >> > >> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > > >