Hi,

I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. 
The amount of data is small -- about 100MB per job, and about 0.25MB per work 
item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds 
to reduce each mapper's output to a single value. The goal is to speed up the 
computation by distributing the tasks across many machines.

I am not sure how the mappers would work in this scenario. My initial thought 
was that there would be one mapper per reducer, and each mapper would fetch its 
input directly from the source database, using an input key provided by Hadoop. 
(Remember it’s only about 0.25MB per work item.) It would then do some 
necessary fix-up and massaging of the data to prepare it for the reduction 
phase.

However, none of the tutorials and example code I’ve seen do it this way. They 
always copy the data from the source database to HDFS first. For my use case, 
this seems wasteful. The data per task is very small and can fit entirely in 
the mapper’s and reducer’s main memory, so I don’t need “big data” redundant 
storage. Also, the data is read only once per task, so there’s nothing to be 
gained by the data locality optimizations of HDFS. Having to copy the data to 
an intermediate data store seems unnecessary and just adds overhead in this 
case.

Is it okay to bypass HDFS for certain types of problems, such as this one? Or 
is there some reason mappers should never perform external I/O? I am very new 
to Hadoop so I don’t have much experience to go on here. Thank you,

Trevor

Reply via email to