Hi Andrew, Do you need the sorting behavior that having an identity reducer gives you? If not, set the number of reduce tasks to 0 and you'll end up with a map only job, which should be significantly faster.
-Todd On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen < andrew-lists-had...@ucsfcti.org> wrote: > Hello, > > I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to > use it to process high volumes of patient physiologic data. As an initial > exercise to gain a better understanding, I have attempted to run the > following problem (which isn't the type of problem that Hadoop was really > designed for, as is my understanding). > > I have a 6G data file, that contains key/value of <sample number, sample > value>. I'd like to convert the values based on a gain/offset to their > physical units. I've setup a MapReduce job using streaming where the mapper > does the conversion, and the reducer is just an identity reducer. Based on > other threads on the mailing list, my initial results are consistent in the > fact that it takes considerably more time to process this in Hadoop then it > is on my Macbook pro (45 minutes vs. 13 minutes). The input is a single 6G > file and it looks like the file is being split into 101 map tasks. This is > consistent with the 64M block sizes. > > So my questions are: > > * Would it help to increase the block size to 128M? Or, decrease the block > size? What are some key factors to think about with this question? > * Are there any other optimizations that I could employ? I have looked > into LzoCompression but I'd like to still work without compression since the > single thread job that I'm comparing to doesn't use any sort of compression. > I know I'm comparing apples to pears a little here so please feel free to > correct this assumption. > * Is Hadoop really only good for jobs where the data doesn't fit on a > single node? At some level, I assume that it can still speedup jobs that do > fit on one node, if only because you are performing tasks in parallel. > > Thanks! > > --Andrew -- Todd Lipcon Software Engineer, Cloudera