Hi,
I’m trying to model an embarrassingly parallel problem as a map-reduce job.
The amount of data is small -- about 100MB per job, and about 0.25MB per work
item -- but the reduce phase is very CPU-intensive, requiring about 30 seconds
to reduce each mapper's output to a single value. The
100MB is very small, so the overhead of putting the data in hdfs is also
very small. Does it even make sense to optimize this? (reading/writing will
only take a second or so) If you don't want to stream data to hdfs and you
have very little data then you should look in to alternative high
You should consider writing a custom InputFormat which reads directly from
the database - while FileInputformat is the most common class for
InputFormat, the specification for InputFormat or what the critical method
getSplits does not require HDFS -
A custom version can return database entries as
You’re right, 100MB is small, but if there are 100,000 jobs, the overhead of
copying data to HDFS adds up. I guess my main concern was whether allowing
mappers to fetch the input data would violate some technical rule or map-reduce
principle.
I have considered alternative solutions like
Ah, yes, I remember reading about custom InputFormats but did not realize they
could bypass HDFS entirely. Sounds like a good solution, I will look into it.
Thanks,
Trevor
On Nov 9, 2014, at 12:48 PM, Steve Lewis lordjoe2...@gmail.com wrote:
You should consider writing a custom InputFormat