Thanks, Miles.
On Jan 16, 2008 11:51 AM, Miles Osborne <[EMAIL PROTECTED]> wrote:
> The number of reduces should be a function of the amount of data needing
> reducing, not the number of mappers.
>
> For example, your mappers might delete 90% of the input data, in which
> case you should only ne
The number of reduces should be a function of the amount of data needing
reducing, not the number of mappers.
For example, your mappers might delete 90% of the input data, in which
case you should only need 1/10 of the number of reducers as mappers.
Miles
On 16/01/2008, Jim the Standing Bear <
hmm.. interesting... these are supposed to be the output from mappers
(and default reducers since I didn't specify any for those jobs)...
but shouldn't the number of reducers match the number of mappers? If
there was only one reducer, it would mean I only had one mapper task
running?? That is why
The part nomenclature does not refer to splits. It refers to how many
reduce processes were involved in actually writing the output file. Files
are split at read-time as necessary.
You will get more of them if you have more reducers.
On 1/16/08 8:25 AM, "Jim the Standing Bear" <[EMAIL PROTECT
Thanks Ted. I just didn't ask it right. Here is a stupid 101
question, which I am sure the answer lies in the documentation
somewhere, just that I was having some difficulties in finding it...
when I do an "ls" on the dfs, I would see this:
/user/bear/output/part-0
I probably got confused
Parallelizing the processing of data occurs at two steps. The first is
during the map phase where the input data file is (hopefully) split across
multiple tasks. This should happen transparently most of the time unless
you have a perverse data format or use unsplittable compression on your
file