Re: a question on number of parallel tasks

Ted Dunning Wed, 16 Jan 2008 08:42:47 -0800

The part nomenclature does not refer to splits.  It refers to how many
reduce processes were involved in actually writing the output file.  Files
are split at read-time as necessary.


You will get more of them if you have more reducers.


On 1/16/08 8:25 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote:

> Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
> question, which I am sure the answer lies in the documentation
> somewhere, just that I was having some difficulties in finding it...
> 
> when I do an "ls" on the dfs,  I would see this:
> /user/bear/output/part-00000 <r 4>
> 
> I probably got confused on what the part-##### means... I thought
> part-##### tells how many splits a file has... so far, I have only
> seen part-00000.  When will it have part-00001, 00002, etc?
> 
> 
> 
> On Jan 16, 2008 11:04 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> 
>> 
>> Parallelizing the processing of data occurs at two steps.  The first is
>> during the map phase where the input data file is (hopefully) split across
>> multiple tasks.  This should happen transparently most of the time unless
>> you have a perverse data format or use unsplittable compression on your
>> file.
>> 
>> This parallelism can occur whether you have one input file or many.
>> 
>> The second level of parallelism is at reduce phase.  You set this by setting
>> the number of reducers.  This will also determine the number of output files
>> that you get.
>> 
>> Depending on your algorithm, it may help or hurt to have one or many
>> reducers.  The recent example of a program to find the 10 largest elements
>> is an example that pretty much requires a single reducer.  Other programs
>> where the mapper produces huge amounts of output would be better served by
>> having many reducers.
>> 
>> This is a general answer since the question is kind of non-specific.
>> 
>> 
>> 
>> On 1/16/08 7:59 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi,
>>> 
>>> How do I make hadoop split its output?  The program I am writing
>>> crawls a catalog tree from a single url, so initially the input
>>> contains only one entry.  after a few iterations, it will have tens of
>>> thousands of urls.  But what I noticed is that the file is always in
>>> one block (part-00000).   What I would like to have is once the number
>>> of entries increases, it can parallelize the job.  Currently it
>>> doesn't seem to be case.
>> 
>> 
> 
>

Re: a question on number of parallel tasks

Reply via email to