Re: Optimal setup for a test problem
Correction, they are 100Mbps NIC's... iperf shows that we're getting about 95 Mbits/sec from one node to another. On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote: @Todd: I do need the sorting behavior, eventually. However, I'll try it with zero reduce jobs to see. @Alex: Yes, I was planning on incrementally building my mapper and reducer functions so currently, the mapper takes the value and multiplies by the gain and adds the offset and outputs a new key/value pair. Started to run the tests but didn't know about how long it should take with the parameters you listed below. However, it seemed like there was no progress being made. Ran it with a increasing parameter values and results are included below: Here is a run with nrFiles 1 and fileSize 10 had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar TestDFSIO -write -nrFiles 1 -fileSize 10 TestFDSIO.0.0.4 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 100 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10 mega bytes, 1 files 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/12 11:57:19 INFO mapred.JobClient: Running job: job_20100407_0017 10/04/12 11:57:20 INFO mapred.JobClient: map 0% reduce 0% 10/04/12 11:57:27 INFO mapred.JobClient: map 100% reduce 0% 10/04/12 11:57:39 INFO mapred.JobClient: map 100% reduce 100% 10/04/12 11:57:41 INFO mapred.JobClient: Job complete: job_20100407_0017 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18 10/04/12 11:57:41 INFO mapred.JobClient: Job Counters 10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1 10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1 10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1 10/04/12 11:57:41 INFO mapred.JobClient: FileSystemCounters 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832 10/04/12 11:57:41 INFO mapred.JobClient: Map-Reduce Framework 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5 10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0 10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1 10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0 10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5 10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10 10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82 10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27 10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0 10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5 10/04/12 11:57:41 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/12 11:57:41 INFO mapred.FileInputFormat:Date time: Mon Apr 12 11:57:41 PST 2010 10/04/12 11:57:41 INFO mapred.FileInputFormat:Number of files: 1 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10 10/04/12 11:57:41 INFO mapred.FileInputFormat: Throughput mb/sec: 8.710801393728223 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: 8.710801124572754 10/04/12 11:57:41 INFO mapred.FileInputFormat: IO rate std deviation: 0.0017763302275007867 10/04/12 11:57:41 INFO mapred.FileInputFormat: Test exec time sec: 22.757 10/04/12 11:57:41 INFO mapred.FileInputFormat: Here is a run with nrFiles 10 and fileSize 100: had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar TestDFSIO -write -nrFiles 10 -fileSize 100 TestFDSIO.0.0.4 10/04/12 11:58:54 INFO mapred.FileInputFormat: nrFiles = 10 10/04/12 11:58:54 INFO mapred.FileInputFormat: fileSize (MB) = 100 10/04/12 11:58:54 INFO mapred.FileInputFormat: bufferSize = 100 10/04/12 11:58:54 INFO mapred.FileInputFormat: creating control file: 100 mega bytes, 10 files 10/04/12 11:58:55 INFO mapred.FileInputFormat: created control files for: 10 files 10/04/12 11:58:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/12 11:58:55 INFO mapred.FileInputFormat: Total input paths to process : 10 10/04/12 11:58:55 INFO mapred.JobClient: Running job: job_20100407_0018 10/04/12 11:58:56 INFO mapred.JobClient:
Re: Optimal setup for a test problem
Andrew, here are some tips for hadoop runtime config: http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html also here are some results from my cluster (using 1GE NICs, Fiber), Dell 5500, 24GB, 8-core (16 hypervised), JBOD, i saw slightly better numbers on a different 4-nodes cluster with HP G5s - TestDFSIO - : write Date time: Wed Mar 31 02:28:59 EDT 2010 Number of files: 10 Total MBytes processed: 1 Throughput mb/sec: 5.615639781416837 Average IO rate mb/sec: 5.631219863891602 IO rate std deviation: 0.2928237500022612 Test exec time sec: 219.095 - TestDFSIO - : read Date time: Wed Mar 31 02:32:21 EDT 2010 Number of files: 10 Total MBytes processed: 1 Throughput mb/sec: 10.662958800459787 Average IO rate mb/sec: 13.391314506530762 IO rate std deviation: 8.181072283752508 Test exec time sec: 157.752 thanks Alex On Mon, Apr 12, 2010 at 4:19 PM, Andrew Nguyen and...@ucsfcti.org wrote: Correction, they are 100Mbps NIC's... iperf shows that we're getting about 95 Mbits/sec from one node to another. On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote: @Todd: I do need the sorting behavior, eventually. However, I'll try it with zero reduce jobs to see. @Alex: Yes, I was planning on incrementally building my mapper and reducer functions so currently, the mapper takes the value and multiplies by the gain and adds the offset and outputs a new key/value pair. Started to run the tests but didn't know about how long it should take with the parameters you listed below. However, it seemed like there was no progress being made. Ran it with a increasing parameter values and results are included below: Here is a run with nrFiles 1 and fileSize 10 had...@cluster-1:/usr/lib/hadoop$ hadoop jar hadoop-0.20.2+228-test.jar TestDFSIO -write -nrFiles 1 -fileSize 10 TestFDSIO.0.0.4 10/04/12 11:57:18 INFO mapred.FileInputFormat: nrFiles = 1 10/04/12 11:57:18 INFO mapred.FileInputFormat: fileSize (MB) = 10 10/04/12 11:57:18 INFO mapred.FileInputFormat: bufferSize = 100 10/04/12 11:57:18 INFO mapred.FileInputFormat: creating control file: 10 mega bytes, 1 files 10/04/12 11:57:19 INFO mapred.FileInputFormat: created control files for: 1 files 10/04/12 11:57:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/12 11:57:19 INFO mapred.FileInputFormat: Total input paths to process : 1 10/04/12 11:57:19 INFO mapred.JobClient: Running job: job_20100407_0017 10/04/12 11:57:20 INFO mapred.JobClient: map 0% reduce 0% 10/04/12 11:57:27 INFO mapred.JobClient: map 100% reduce 0% 10/04/12 11:57:39 INFO mapred.JobClient: map 100% reduce 100% 10/04/12 11:57:41 INFO mapred.JobClient: Job complete: job_20100407_0017 10/04/12 11:57:41 INFO mapred.JobClient: Counters: 18 10/04/12 11:57:41 INFO mapred.JobClient: Job Counters 10/04/12 11:57:41 INFO mapred.JobClient: Launched reduce tasks=1 10/04/12 11:57:41 INFO mapred.JobClient: Launched map tasks=1 10/04/12 11:57:41 INFO mapred.JobClient: Data-local map tasks=1 10/04/12 11:57:41 INFO mapred.JobClient: FileSystemCounters 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_READ=98 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_READ=113 10/04/12 11:57:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=228 10/04/12 11:57:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10485832 10/04/12 11:57:41 INFO mapred.JobClient: Map-Reduce Framework 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input groups=5 10/04/12 11:57:41 INFO mapred.JobClient: Combine output records=0 10/04/12 11:57:41 INFO mapred.JobClient: Map input records=1 10/04/12 11:57:41 INFO mapred.JobClient: Reduce shuffle bytes=0 10/04/12 11:57:41 INFO mapred.JobClient: Reduce output records=5 10/04/12 11:57:41 INFO mapred.JobClient: Spilled Records=10 10/04/12 11:57:41 INFO mapred.JobClient: Map output bytes=82 10/04/12 11:57:41 INFO mapred.JobClient: Map input bytes=27 10/04/12 11:57:41 INFO mapred.JobClient: Combine input records=0 10/04/12 11:57:41 INFO mapred.JobClient: Map output records=5 10/04/12 11:57:41 INFO mapred.JobClient: Reduce input records=5 10/04/12 11:57:41 INFO mapred.FileInputFormat: - TestDFSIO - : write 10/04/12 11:57:41 INFO mapred.FileInputFormat:Date time: Mon Apr 12 11:57:41 PST 2010 10/04/12 11:57:41 INFO mapred.FileInputFormat:Number of files: 1 10/04/12 11:57:41 INFO mapred.FileInputFormat: Total MBytes processed: 10 10/04/12 11:57:41 INFO mapred.FileInputFormat: Throughput mb/sec: 8.710801393728223 10/04/12 11:57:41 INFO mapred.FileInputFormat: Average IO rate mb/sec: 8.710801124572754 10/04/12 11:57:41 INFO mapred.FileInputFormat: IO rate std deviation: 0.0017763302275007867
Re: Optimal setup for a test problem
Good to know... The problem is that I'm in an academic environment that needs a lot of convincing regarding new computational technologies. I need to show proven benefit before getting the funds to actually implement anything. These servers were the best I could come up with for this proof-of-concept. I changed some settings on the nodes and have been experimenting - and I'm seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your observations below. Given that, would increasing the block sizes help my performance? This should result in fewer map jobs and keeping the computation locally, longer...? I just need to show that the numbers are better than a single machine, even if sacrificing redundancy (or other factors) in the current setup. @alex: Thanks for the links, it gives me another bit of evidence to convince those controlling the money flow... --Andrew On Tue, 13 Apr 2010 10:29:06 -0700, Todd Lipcon t...@cloudera.com wrote: On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: I don't think you can :-). Sorry, they are 100Mbps NIC's... I get 95Mbit/sec from one node to another with iperf. Should I still be expecting such dismal performance with just 100Mbps? Yes - in my experience on gigabit, when lots of transfers are going between the nodes, TCP performance actually drops to around half the network capacity. In the case of 100Mbps, this is probably going to be around 5MB/sec So when you're writing output at 3x replication, it's going to be very very slow on this network. -Todd On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote: On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: 5 identically spec'ed nodes, each has: 2 GB RAM Pentium 4 3.0G with HT 250GB HDD on PATA 10Mbps NIC This is probably your issue - 10mbps nic? I didn't know you could even get those anymore! Hadoop runs on commodity hardware, but you're not likely to get reasonable performance with hardware like that. -Todd On Apr 12, 2010, at 11:58 AM, alex kamil wrote: Andrew, I would also suggest to run DFSIO benchmark to isolate io related issues hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 there are additional tests specific for mapreduce - run hadoop jar hadoop-0.20.2-test.jar for the complete list 45 min for mapping 6GB on 5 nodes is way too high assuming your gain/offset conversion is a simple algebraic manipulation it takes less than 5 min to run a simple mapper (using streaming) on a 4 nodes cluster on something like 10GB, the mapper i used was an awk command extracting key:value pair from a log (no reducer) Thanks Alex On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon t...@cloudera.com wrote: Hi Andrew, Do you need the sorting behavior that having an identity reducer gives you? If not, set the number of reduce tasks to 0 and you'll end up with a map only job, which should be significantly faster. -Todd On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: Hello, I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it to process high volumes of patient physiologic data. As an initial exercise to gain a better understanding, I have attempted to run the following problem (which isn't the type of problem that Hadoop was really designed for, as is my understanding). I have a 6G data file, that contains key/value of sample number, sample value. I'd like to convert the values based on a gain/offset to their physical units. I've setup a MapReduce job using streaming where the mapper does the conversion, and the reducer is just an identity reducer. Based on other threads on the mailing list, my initial results are consistent in the fact that it takes considerably more time to process this in Hadoop then it is on my Macbook pro (45 minutes vs. 13 minutes). The input is a single 6G file and it looks like the file is being split into 101 map tasks. This is consistent with the 64M block sizes. So my questions are: * Would it help to increase the block size to 128M? Or, decrease the block size? What are some key factors to think about with this question? * Are there any other optimizations that I could employ? I have looked into LzoCompression but I'd like to still work without compression since the single thread job that I'm comparing to doesn't use any sort of compression. I know I'm comparing apples to pears a little here so please feel free to correct this assumption. * Is Hadoop really only good for jobs where the data doesn't fit on a single node? At some level, I assume that it can still speedup jobs that do fit
Optimal setup for a test problem
Hello, I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it to process high volumes of patient physiologic data. As an initial exercise to gain a better understanding, I have attempted to run the following problem (which isn't the type of problem that Hadoop was really designed for, as is my understanding). I have a 6G data file, that contains key/value of sample number, sample value. I'd like to convert the values based on a gain/offset to their physical units. I've setup a MapReduce job using streaming where the mapper does the conversion, and the reducer is just an identity reducer. Based on other threads on the mailing list, my initial results are consistent in the fact that it takes considerably more time to process this in Hadoop then it is on my Macbook pro (45 minutes vs. 13 minutes). The input is a single 6G file and it looks like the file is being split into 101 map tasks. This is consistent with the 64M block sizes. So my questions are: * Would it help to increase the block size to 128M? Or, decrease the block size? What are some key factors to think about with this question? * Are there any other optimizations that I could employ? I have looked into LzoCompression but I'd like to still work without compression since the single thread job that I'm comparing to doesn't use any sort of compression. I know I'm comparing apples to pears a little here so please feel free to correct this assumption. * Is Hadoop really only good for jobs where the data doesn't fit on a single node? At some level, I assume that it can still speedup jobs that do fit on one node, if only because you are performing tasks in parallel. Thanks! --Andrew
Re: Optimal setup for a test problem
Andrew, I would also suggest to run DFSIO benchmark to isolate io related issues hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 there are additional tests specific for mapreduce - run hadoop jar hadoop-0.20.2-test.jar for the complete list 45 min for mapping 6GB on 5 nodes is way too high assuming your gain/offset conversion is a simple algebraic manipulation it takes less than 5 min to run a simple mapper (using streaming) on a 4 nodes cluster on something like 10GB, the mapper i used was an awk command extracting key:value pair from a log (no reducer) Thanks Alex On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon t...@cloudera.com wrote: Hi Andrew, Do you need the sorting behavior that having an identity reducer gives you? If not, set the number of reduce tasks to 0 and you'll end up with a map only job, which should be significantly faster. -Todd On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: Hello, I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it to process high volumes of patient physiologic data. As an initial exercise to gain a better understanding, I have attempted to run the following problem (which isn't the type of problem that Hadoop was really designed for, as is my understanding). I have a 6G data file, that contains key/value of sample number, sample value. I'd like to convert the values based on a gain/offset to their physical units. I've setup a MapReduce job using streaming where the mapper does the conversion, and the reducer is just an identity reducer. Based on other threads on the mailing list, my initial results are consistent in the fact that it takes considerably more time to process this in Hadoop then it is on my Macbook pro (45 minutes vs. 13 minutes). The input is a single 6G file and it looks like the file is being split into 101 map tasks. This is consistent with the 64M block sizes. So my questions are: * Would it help to increase the block size to 128M? Or, decrease the block size? What are some key factors to think about with this question? * Are there any other optimizations that I could employ? I have looked into LzoCompression but I'd like to still work without compression since the single thread job that I'm comparing to doesn't use any sort of compression. I know I'm comparing apples to pears a little here so please feel free to correct this assumption. * Is Hadoop really only good for jobs where the data doesn't fit on a single node? At some level, I assume that it can still speedup jobs that do fit on one node, if only because you are performing tasks in parallel. Thanks! --Andrew -- Todd Lipcon Software Engineer, Cloudera
Re: Optimal setup for a test problem
I don't think you can :-). Sorry, they are 100Mbps NIC's... I get 95Mbit/sec from one node to another with iperf. Should I still be expecting such dismal performance with just 100Mbps? On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote: On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: 5 identically spec'ed nodes, each has: 2 GB RAM Pentium 4 3.0G with HT 250GB HDD on PATA 10Mbps NIC This is probably your issue - 10mbps nic? I didn't know you could even get those anymore! Hadoop runs on commodity hardware, but you're not likely to get reasonable performance with hardware like that. -Todd On Apr 12, 2010, at 11:58 AM, alex kamil wrote: Andrew, I would also suggest to run DFSIO benchmark to isolate io related issues hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 there are additional tests specific for mapreduce - run hadoop jar hadoop-0.20.2-test.jar for the complete list 45 min for mapping 6GB on 5 nodes is way too high assuming your gain/offset conversion is a simple algebraic manipulation it takes less than 5 min to run a simple mapper (using streaming) on a 4 nodes cluster on something like 10GB, the mapper i used was an awk command extracting key:value pair from a log (no reducer) Thanks Alex On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon t...@cloudera.com wrote: Hi Andrew, Do you need the sorting behavior that having an identity reducer gives you? If not, set the number of reduce tasks to 0 and you'll end up with a map only job, which should be significantly faster. -Todd On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: Hello, I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it to process high volumes of patient physiologic data. As an initial exercise to gain a better understanding, I have attempted to run the following problem (which isn't the type of problem that Hadoop was really designed for, as is my understanding). I have a 6G data file, that contains key/value of sample number, sample value. I'd like to convert the values based on a gain/offset to their physical units. I've setup a MapReduce job using streaming where the mapper does the conversion, and the reducer is just an identity reducer. Based on other threads on the mailing list, my initial results are consistent in the fact that it takes considerably more time to process this in Hadoop then it is on my Macbook pro (45 minutes vs. 13 minutes). The input is a single 6G file and it looks like the file is being split into 101 map tasks. This is consistent with the 64M block sizes. So my questions are: * Would it help to increase the block size to 128M? Or, decrease the block size? What are some key factors to think about with this question? * Are there any other optimizations that I could employ? I have looked into LzoCompression but I'd like to still work without compression since the single thread job that I'm comparing to doesn't use any sort of compression. I know I'm comparing apples to pears a little here so please feel free to correct this assumption. * Is Hadoop really only good for jobs where the data doesn't fit on a single node? At some level, I assume that it can still speedup jobs that do fit on one node, if only because you are performing tasks in parallel. Thanks! --Andrew -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera
Re: Optimal setup for a test problem
I guess my question below can be rephrased as, What is the absolute minimum hw requirements for me to still see 'better-than-a-single-machine' performance? Thanks! On Apr 12, 2010, at 1:45 PM, Andrew Nguyen wrote: I don't think you can :-). Sorry, they are 100Mbps NIC's... I get 95Mbit/sec from one node to another with iperf. Should I still be expecting such dismal performance with just 100Mbps? On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote: On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: 5 identically spec'ed nodes, each has: 2 GB RAM Pentium 4 3.0G with HT 250GB HDD on PATA 10Mbps NIC This is probably your issue - 10mbps nic? I didn't know you could even get those anymore! Hadoop runs on commodity hardware, but you're not likely to get reasonable performance with hardware like that. -Todd On Apr 12, 2010, at 11:58 AM, alex kamil wrote: Andrew, I would also suggest to run DFSIO benchmark to isolate io related issues hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 there are additional tests specific for mapreduce - run hadoop jar hadoop-0.20.2-test.jar for the complete list 45 min for mapping 6GB on 5 nodes is way too high assuming your gain/offset conversion is a simple algebraic manipulation it takes less than 5 min to run a simple mapper (using streaming) on a 4 nodes cluster on something like 10GB, the mapper i used was an awk command extracting key:value pair from a log (no reducer) Thanks Alex On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon t...@cloudera.com wrote: Hi Andrew, Do you need the sorting behavior that having an identity reducer gives you? If not, set the number of reduce tasks to 0 and you'll end up with a map only job, which should be significantly faster. -Todd On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: Hello, I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it to process high volumes of patient physiologic data. As an initial exercise to gain a better understanding, I have attempted to run the following problem (which isn't the type of problem that Hadoop was really designed for, as is my understanding). I have a 6G data file, that contains key/value of sample number, sample value. I'd like to convert the values based on a gain/offset to their physical units. I've setup a MapReduce job using streaming where the mapper does the conversion, and the reducer is just an identity reducer. Based on other threads on the mailing list, my initial results are consistent in the fact that it takes considerably more time to process this in Hadoop then it is on my Macbook pro (45 minutes vs. 13 minutes). The input is a single 6G file and it looks like the file is being split into 101 map tasks. This is consistent with the 64M block sizes. So my questions are: * Would it help to increase the block size to 128M? Or, decrease the block size? What are some key factors to think about with this question? * Are there any other optimizations that I could employ? I have looked into LzoCompression but I'd like to still work without compression since the single thread job that I'm comparing to doesn't use any sort of compression. I know I'm comparing apples to pears a little here so please feel free to correct this assumption. * Is Hadoop really only good for jobs where the data doesn't fit on a single node? At some level, I assume that it can still speedup jobs that do fit on one node, if only because you are performing tasks in parallel. Thanks! --Andrew -- Todd Lipcon Software Engineer, Cloudera -- Todd Lipcon Software Engineer, Cloudera