also http://www.slideshare.net/cloudera/hw09-optimizing-hadoop-deployments
On Tue, Apr 13, 2010 at 12:58 PM, alex kamil <alex.ka...@gmail.com> wrote: > Andrew, > > here are some tips for hadoop runtime config: > http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html > also > > here are some results from my cluster (using 1GE NICs, Fiber), Dell 5500, > 24GB, 8-core (16 hypervised), JBOD, i saw slightly better numbers on a > different 4-nodes cluster with HP G5s > > > ----- TestDFSIO ----- : write > Date & time: Wed Mar 31 02:28:59 EDT 2010 > Number of files: 10 > Total MBytes processed: 10000 > Throughput mb/sec: 5.615639781416837 > Average IO rate mb/sec: 5.631219863891602 > IO rate std deviation: 0.2928237500022612 > Test exec time sec: 219.095 > > ----- TestDFSIO ----- : read > Date & time: Wed Mar 31 02:32:21 EDT 2010 > Number of files: 10 > Total MBytes processed: 10000 > Throughput mb/sec: 10.662958800459787 > Average IO rate mb/sec: 13.391314506530762 > IO rate std deviation: 8.181072283752508 > Test exec time sec: 157.752 > > thanks > Alex > > On Mon, Apr 12, 2010 at 4:19 PM, Andrew Nguyen <and...@ucsfcti.org> wrote: > >> Correction, they are 100Mbps NIC's... >> >> iperf shows that we're getting about 95 Mbits/sec from one node to >> another. >> >> On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote: >> >> > @Todd: >> > >> > I do need the sorting behavior, eventually. However, I'll try it with >> zero reduce jobs to see. >> > >> > @Alex: >> > >> > Yes, I was planning on incrementally building my mapper and reducer >> functions so currently, the mapper takes the value and multiplies by the >> gain and adds the offset and outputs a new key/value pair. >> > >> > Started to run the tests but didn't know about how long it should take >> with the parameters you listed below. However, it seemed like there was no >> progress being made. Ran it with a increasing parameter values and results >> are included below: >> > >> > Here is a run with nrFiles 1 and fileSize 10 >> > >> > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar >> TestDFSIO -write -nrFiles 1 -fileSize 10 >> > TestFDSIO.0.0.4 >> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1 >> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10 >> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 1000000 >> > 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10 >> mega bytes, 1 files >> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files >> for: 1 files >> > 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for >> parsing the arguments. Applications should implement Tool for the same. >> > 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to >> process : 1 >> > 10/04/12 11:57:19 INFO mapred.JobClient: Running job: >> job_201004111107_0017 >> > 10/04/12 11:57:20 INFO mapred.JobClient: map 0% reduce 0% >> > 10/04/12 11:57:27 INFO mapred.JobClient: map 100% reduce 0% >> > 10/04/12 11:57:39 INFO mapred.JobClient: map 100% reduce 100% >> > 10/04/12 11:57:41 INFO mapred.JobClient: Job complete: >> job_201004111107_0017 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Job Counters >> > 10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1 >> > 10/04/12 11:57:41 INFO mapred.JobClient: FileSystemCounters >> > 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98 >> > 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113 >> > 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228 >> > 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Map-Reduce Framework >> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5 >> > 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5 >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : >> write >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Date & time: >> Mon Apr 12 11:57:41 PST 2010 >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Number of files: 1 >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: >> 10 >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Throughput mb/sec: >> 8.710801393728223 >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: >> 8.710801124572754 >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: IO rate std deviation: >> 0.0017763302275007867 >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: Test exec time sec: >> 22.757 >> > 10/04/12 11:57:41 INFO mapred.FileInputFormat: >> > >> > Here is a run with nrFiles 10 and fileSize 100: >> > >> > had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar >> TestDFSIO -write -nrFiles 10 -fileSize 100 >> > TestFDSIO.0.0.4 >> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: nrFiles = 10 >> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: fileSize (MB) = 100 >> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: bufferSize = 1000000 >> > 10/04/12 11:58:54 INFO mapred.FileInputFormat: creating control file: >> 100 mega bytes, 10 files >> > 10/04/12 11:58:55 INFO mapred.FileInputFormat: created control files >> for: 10 files >> > 10/04/12 11:58:55 WARN mapred.JobClient: Use GenericOptionsParser for >> parsing the arguments. Applications should implement Tool for the same. >> > 10/04/12 11:58:55 INFO mapred.FileInputFormat: Total input paths to >> process : 10 >> > 10/04/12 11:58:55 INFO mapred.JobClient: Running job: >> job_201004111107_0018 >> > 10/04/12 11:58:56 INFO mapred.JobClient: map 0% reduce 0% >> > 10/04/12 11:59:45 INFO mapred.JobClient: map 10% reduce 0% >> > 10/04/12 11:59:54 INFO mapred.JobClient: map 10% reduce 3% >> > 10/04/12 11:59:59 INFO mapred.JobClient: map 20% reduce 3% >> > 10/04/12 12:00:01 INFO mapred.JobClient: map 40% reduce 3% >> > 10/04/12 12:00:03 INFO mapred.JobClient: map 50% reduce 3% >> > 10/04/12 12:00:08 INFO mapred.JobClient: map 60% reduce 3% >> > 10/04/12 12:00:09 INFO mapred.JobClient: map 60% reduce 16% >> > 10/04/12 12:00:11 INFO mapred.JobClient: map 70% reduce 16% >> > 10/04/12 12:00:18 INFO mapred.JobClient: map 70% reduce 20% >> > 10/04/12 12:00:23 INFO mapred.JobClient: map 80% reduce 20% >> > 10/04/12 12:00:24 INFO mapred.JobClient: map 80% reduce 23% >> > 10/04/12 12:00:26 INFO mapred.JobClient: map 90% reduce 23% >> > 10/04/12 12:00:30 INFO mapred.JobClient: map 100% reduce 23% >> > 10/04/12 12:00:33 INFO mapred.JobClient: map 100% reduce 26% >> > 10/04/12 12:00:39 INFO mapred.JobClient: map 100% reduce 100% >> > 10/04/12 12:00:41 INFO mapred.JobClient: Job complete: >> job_201004111107_0018 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Counters: 18 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Job Counters >> > 10/04/12 12:00:41 INFO mapred.JobClient: Launched reduce tasks=1 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Launched map tasks=14 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Data-local map tasks=14 >> > 10/04/12 12:00:41 INFO mapred.JobClient: FileSystemCounters >> > 10/04/12 12:00:41 INFO mapred.JobClient: FILE_BYTES_READ=961 >> > 10/04/12 12:00:41 INFO mapred.JobClient: HDFS_BYTES_READ=1130 >> > 10/04/12 12:00:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2296 >> > 10/04/12 12:00:41 INFO mapred.JobClient: >> HDFS_BYTES_WRITTEN=1048576079 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Map-Reduce Framework >> > 10/04/12 12:00:41 INFO mapred.JobClient: Reduce input groups=5 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Combine output records=0 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Map input records=10 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Reduce shuffle bytes=914 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Reduce output records=5 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Spilled Records=100 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Map output bytes=855 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Map input bytes=270 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Combine input records=0 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Map output records=50 >> > 10/04/12 12:00:41 INFO mapred.JobClient: Reduce input records=50 >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : >> write >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Date & time: >> Mon Apr 12 12:00:41 PST 2010 >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Number of files: >> 10 >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Total MBytes processed: >> 1000 >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Throughput mb/sec: >> 1.9073850132944736 >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: >> 2.1501593589782715 >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: IO rate std deviation: >> 0.8994861001170683 >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: Test exec time sec: >> 106.45 >> > 10/04/12 12:00:41 INFO mapred.FileInputFormat: >> > >> > The throughput is a lot lower for 10/100 vs 1/10... >> > >> > Here's some rough specs of our cluster: >> > >> > 5 identically spec'ed nodes, each has: >> > >> > 2 GB RAM >> > Pentium 4 3.0G with HT >> > 250GB HDD on PATA >> > 10Mbps NIC >> > >> > They are on a private network on a Dell switch. >> > >> > Thanks! >> > >> > --Andrew >> > >> > On Apr 12, 2010, at 11:58 AM, alex kamil wrote: >> > >> >> Andrew, >> >> >> >> I would also suggest to run DFSIO benchmark to isolate io related >> issues >> >> >> >> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 >> -fileSize 1000 >> >> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize >> 1000 >> >> >> >> there are additional tests specific for mapreduce - run "hadoop jar >> hadoop-0.20.2-test.jar" for the complete list >> >> >> >> 45 min for mapping 6GB on 5 nodes is way too high assuming your >> gain/offset conversion is a simple algebraic manipulation >> >> >> >> it takes less than 5 min to run a simple mapper (using streaming) on a >> 4 nodes cluster on something like 10GB, the mapper i used was an awk command >> extracting <key:value> pair from a log (no reducer) >> >> >> >> Thanks >> >> Alex >> >> >> >> >> >> >> >> >> >> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> >> wrote: >> >> Hi Andrew, >> >> >> >> Do you need the sorting behavior that having an identity reducer gives >> you? >> >> If not, set the number of reduce tasks to 0 and you'll end up with a >> map >> >> only job, which should be significantly faster. >> >> >> >> -Todd >> >> >> >> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen < >> >> andrew-lists-had...@ucsfcti.org> wrote: >> >> >> >>> Hello, >> >>> >> >>> I recently setup a 5 node cluster (1 master, 4 slaves) and am looking >> to >> >>> use it to process high volumes of patient physiologic data. As an >> initial >> >>> exercise to gain a better understanding, I have attempted to run the >> >>> following problem (which isn't the type of problem that Hadoop was >> really >> >>> designed for, as is my understanding). >> >>> >> >>> I have a 6G data file, that contains key/value of <sample number, >> sample >> >>> value>. I'd like to convert the values based on a gain/offset to >> their >> >>> physical units. I've setup a MapReduce job using streaming where the >> mapper >> >>> does the conversion, and the reducer is just an identity reducer. >> Based on >> >>> other threads on the mailing list, my initial results are consistent >> in the >> >>> fact that it takes considerably more time to process this in Hadoop >> then it >> >>> is on my Macbook pro (45 minutes vs. 13 minutes). The input is a >> single 6G >> >>> file and it looks like the file is being split into 101 map tasks. >> This is >> >>> consistent with the 64M block sizes. >> >>> >> >>> So my questions are: >> >>> >> >>> * Would it help to increase the block size to 128M? Or, decrease the >> block >> >>> size? What are some key factors to think about with this question? >> >>> * Are there any other optimizations that I could employ? I have >> looked >> >>> into LzoCompression but I'd like to still work without compression >> since the >> >>> single thread job that I'm comparing to doesn't use any sort of >> compression. >> >>> I know I'm comparing apples to pears a little here so please feel free >> to >> >>> correct this assumption. >> >>> * Is Hadoop really only good for jobs where the data doesn't fit on a >> >>> single node? At some level, I assume that it can still speedup jobs >> that do >> >>> fit on one node, if only because you are performing tasks in parallel. >> >>> >> >>> Thanks! >> >>> >> >>> --Andrew >> >> >> >> >> >> >> >> >> >> -- >> >> Todd Lipcon >> >> Software Engineer, Cloudera >> >> >> > >> >> >