Re: a question on number of parallel tasks

2008-01-16 Thread Jim the Standing Bear
Thanks, Miles. On Jan 16, 2008 11:51 AM, Miles Osborne <[EMAIL PROTECTED]> wrote: > The number of reduces should be a function of the amount of data needing > reducing, not the number of mappers. > > For example, your mappers might delete 90% of the input data, in which > case you should only ne

Re: a question on number of parallel tasks

2008-01-16 Thread Miles Osborne
The number of reduces should be a function of the amount of data needing reducing, not the number of mappers. For example, your mappers might delete 90% of the input data, in which case you should only need 1/10 of the number of reducers as mappers. Miles On 16/01/2008, Jim the Standing Bear <

Re: a question on number of parallel tasks

2008-01-16 Thread Jim the Standing Bear
hmm.. interesting... these are supposed to be the output from mappers (and default reducers since I didn't specify any for those jobs)... but shouldn't the number of reducers match the number of mappers? If there was only one reducer, it would mean I only had one mapper task running?? That is why

Re: a question on number of parallel tasks

2008-01-16 Thread Ted Dunning
The part nomenclature does not refer to splits. It refers to how many reduce processes were involved in actually writing the output file. Files are split at read-time as necessary. You will get more of them if you have more reducers. On 1/16/08 8:25 AM, "Jim the Standing Bear" <[EMAIL PROTECT

Re: a question on number of parallel tasks

2008-01-16 Thread Jim the Standing Bear
Thanks Ted. I just didn't ask it right. Here is a stupid 101 question, which I am sure the answer lies in the documentation somewhere, just that I was having some difficulties in finding it... when I do an "ls" on the dfs, I would see this: /user/bear/output/part-0 I probably got confused

Re: a question on number of parallel tasks

2008-01-16 Thread Ted Dunning
Parallelizing the processing of data occurs at two steps. The first is during the map phase where the input data file is (hopefully) split across multiple tasks. This should happen transparently most of the time unless you have a perverse data format or use unsplittable compression on your file