Re: Number of reduce tasks

Pankaj Gupta Mon, 18 Jun 2012 07:18:24 -0700

Aniket: No, I am not using hcatalog. 

To follow up on this thread, I was indeed able to run multiple reduce tasks 
using the PARALLEL clause. Thanks everyone for helping out. Unfortunately, I 
ran into an out of memory error after that and I'm debugging that now (created 
a separate thread for advice).


Thanks,
Pankaj
On Jun 18, 2012, at 1:29 AM, Aniket Mokashi wrote:

> Pankaj, are you using hcatalog?
> 
> On Fri, Jun 1, 2012 at 5:24 PM, Prashant Kommireddi 
> <[email protected]>wrote:
> 
>> Right. And the documentation provides a list of operations that can be
>> parallelized.
>> 
>> On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <[email protected]> wrote:
>> 
>>> That being said, some operators such as "group all" and limit, do
>> require using only 1 reducer, by nature. So it depends on what your script
>> is doing.
>>> 
>>> On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[email protected]>
>> wrote:
>>> 
>>>> Automatic Heuristic works the same in 0.9.1
>>>> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
>>>> better off setting it manually looking at job tracker counters.
>>>> 
>>>> You should be fine with using PARALLEL for any of the operators
>> mentioned
>>>> on the doc.
>>>> 
>>>> -Prashant
>>>> 
>>>> 
>>>> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[email protected]>
>> wrote:
>>>> 
>>>>> Hi Prashant,
>>>>> 
>>>>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems
>> like a
>>>>> very useful upgrade. For the moment though it seems that I should be
>> able
>>>>> to use the 1GB per reducer heuristic and specify the number of
>> reducers in
>>>>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this
>> sound
>>>>> right?
>>>>> 
>>>>> Thanks,
>>>>> Pankaj
>>>>> 
>>>>> 
>>>>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
>>>>> 
>>>>>> Also, please note default number of reducers are based on input
>> dataset.
>>>>> In
>>>>>> the basic case, Pig will "automatically" spawn a reducer for each GB
>> of
>>>>>> input, so if your input dataset size is 500 GB you should see 500
>>>>> reducers
>>>>>> being spawned (though this is excessive in a lot of cases).
>>>>>> 
>>>>>> This document talks about parallelism
>>>>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel
>>>>>> 
>>>>>> Setting the right number of reducers (PARALLEL or set
>> default_parallel)
>>>>>> depends on what you are doing with it. If the reducer is CPU intensive
>>>>> (may
>>>>>> be a complex UDF running on reducer side), you would probably spawn
>> more
>>>>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB
>> per
>>>>>> reducer) holds good for regular aggregations (SUM, COUNT..).
>>>>>> 
>>>>>> 
>>>>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>>>>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>>>>>> of reduce shuffle bytes and see if it performs well
>>>>>> 3. If not, adjust it according to your Reducer heap size. More the
>>>>> heap,
>>>>>> less is the data spilled to disk.
>>>>>> 
>>>>>> There are a few more properties on the Reduce side (buffer size etc)
>> but
>>>>>> that probably is not required to start with.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Prashant
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[email protected]
>>>>>> wrote:
>>>>>> 
>>>>>>> Pankaj,
>>>>>>> 
>>>>>>> What version of pig are you using? In later versions of pig, it
>> should
>>>>> have
>>>>>>> some logic around automatically setting parallelisms (though
>> sometimes
>>>>>>> these heuristics will be wrong).
>>>>>>> 
>>>>>>> There are also some operations which will force you to use 1
>> reducer. It
>>>>>>> depends on what your script is doing.
>>>>>>> 
>>>>>>> 2012/6/1 Pankaj Gupta <[email protected]>
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I just realized that one of my large scale pig jobs that has 100K
>> map
>>>>>>> jobs
>>>>>>>> actually only has one reduce task. Reading the documentation I see
>> that
>>>>>>> the
>>>>>>>> number of reduce tasks is defined by the PARALLEL clause whose
>> default
>>>>>>>> value is 1. I have a few questions around this:
>>>>>>>> 
>>>>>>>> # Why is the default value of reduce tasks 1?
>>>>>>>> # (Related to first question) Why aren't reduce tasks parallelized
>>>>>>>> automatically in Pig?
>>>>>>>> # How do I choose a good value of reduce tasks for my pig jobs?
>>>>>>>> 
>>>>>>>> Thanks in Advance,
>>>>>>>> Pankaj
>>>>>>> 
>>>>> 
>>>>> 
>> 
> 
> 
> 
> -- 
> "...:::Aniket:::... Quetzalco@tl"

Re: Number of reduce tasks

Reply via email to