Parallelizing the processing of data occurs at two steps.  The first is
during the map phase where the input data file is (hopefully) split across
multiple tasks.  This should happen transparently most of the time unless
you have a perverse data format or use unsplittable compression on your
file.

This parallelism can occur whether you have one input file or many.

The second level of parallelism is at reduce phase.  You set this by setting
the number of reducers.  This will also determine the number of output files
that you get.

Depending on your algorithm, it may help or hurt to have one or many
reducers.  The recent example of a program to find the 10 largest elements
is an example that pretty much requires a single reducer.  Other programs
where the mapper produces huge amounts of output would be better served by
having many reducers.

This is a general answer since the question is kind of non-specific.


On 1/16/08 7:59 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> How do I make hadoop split its output?  The program I am writing
> crawls a catalog tree from a single url, so initially the input
> contains only one entry.  after a few iterations, it will have tens of
> thousands of urls.  But what I noticed is that the file is always in
> one block (part-00000).   What I would like to have is once the number
> of entries increases, it can parallelize the job.  Currently it
> doesn't seem to be case.

Reply via email to