Hello,

I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it 
to process high volumes of patient physiologic data.  As an initial exercise to 
gain a better understanding, I have attempted to run the following problem 
(which isn't the type of problem that Hadoop was really designed for, as is my 
understanding).

I have a 6G data file, that contains key/value of <sample number, sample 
value>.  I'd like to convert the values based on a gain/offset to their 
physical units.  I've setup a MapReduce job using streaming where the mapper 
does the conversion, and the reducer is just an identity reducer.  Based on 
other threads on the mailing list, my initial results are consistent in the 
fact that it takes considerably more time to process this in Hadoop then it is 
on my Macbook pro (45 minutes vs. 13 minutes).  The input is a single 6G file 
and it looks like the file is being split into 101 map tasks.  This is 
consistent with the 64M block sizes.  

So my questions are:

* Would it help to increase the block size to 128M?  Or, decrease the block 
size?  What are some key factors to think about with this question?
* Are there any other optimizations that I could employ?  I have looked into 
LzoCompression but I'd like to still work without compression since the single 
thread job that I'm comparing to doesn't use any sort of compression.  I know 
I'm comparing apples to pears a little here so please feel free to correct this 
assumption.
* Is Hadoop really only good for jobs where the data doesn't fit on a single 
node?  At some level, I assume that it can still speedup jobs that do fit on 
one node, if only because you are performing tasks in parallel.

Thanks!

--Andrew

Reply via email to