Yes.  I am recommending a pre-processing step before the map-reduce program.

And yes. They do get split up again.  They also get copied to multiple nodes
so that the reads can proceed in parallel.  The most important effects of
concatenation and importing into HDFS are the parallelism and the reading of
sequential disk blocks in processing.
 
How many replicas, how many large files and how small the splits are
determines the number of map functions that you can run in parallel without
getting IO bound.

If you are working on a small problem, then running Hadoop on a single node
works just fine and accessing the local file system works just fine, but if
you can do that, you might as well just write a sequential program in the
first place.  If you have a large problem that requires parallelism, then
reading from a local file system is likely be be a serious bottle neck.
This is particularly true if you are processing your data repeatedly as is
relatively common when, say, doing log processing of various kinds at
multiple time scales.


On 8/26/07 5:45 PM, "mfc" <[EMAIL PROTECTED]> wrote:

> [concatenation .. Compression]...but then the map/reduce job in HADOOP breaks
the large files back down
> into small chunks. This is what prompted the question in the first place
> about running Map/Reduce directly on the small files in the local file
> system.
> 
> I'm wondering if doing the conversion to large files and copy into HDFS
> would introduce a lot of overhead that would not be neccessary if map/reduce
> could be run directly on the local file system on the small files.

Reply via email to