Aniket: No, I am not using hcatalog. To follow up on this thread, I was indeed able to run multiple reduce tasks using the PARALLEL clause. Thanks everyone for helping out. Unfortunately, I ran into an out of memory error after that and I'm debugging that now (created a separate thread for advice).
Thanks, Pankaj On Jun 18, 2012, at 1:29 AM, Aniket Mokashi wrote: > Pankaj, are you using hcatalog? > > On Fri, Jun 1, 2012 at 5:24 PM, Prashant Kommireddi > <[email protected]>wrote: > >> Right. And the documentation provides a list of operations that can be >> parallelized. >> >> On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy <[email protected]> wrote: >> >>> That being said, some operators such as "group all" and limit, do >> require using only 1 reducer, by nature. So it depends on what your script >> is doing. >>> >>> On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi <[email protected]> >> wrote: >>> >>>> Automatic Heuristic works the same in 0.9.1 >>>> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be >>>> better off setting it manually looking at job tracker counters. >>>> >>>> You should be fine with using PARALLEL for any of the operators >> mentioned >>>> on the doc. >>>> >>>> -Prashant >>>> >>>> >>>> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[email protected]> >> wrote: >>>> >>>>> Hi Prashant, >>>>> >>>>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems >> like a >>>>> very useful upgrade. For the moment though it seems that I should be >> able >>>>> to use the 1GB per reducer heuristic and specify the number of >> reducers in >>>>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this >> sound >>>>> right? >>>>> >>>>> Thanks, >>>>> Pankaj >>>>> >>>>> >>>>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote: >>>>> >>>>>> Also, please note default number of reducers are based on input >> dataset. >>>>> In >>>>>> the basic case, Pig will "automatically" spawn a reducer for each GB >> of >>>>>> input, so if your input dataset size is 500 GB you should see 500 >>>>> reducers >>>>>> being spawned (though this is excessive in a lot of cases). >>>>>> >>>>>> This document talks about parallelism >>>>>> http://pig.apache.org/docs/r0.10.0/perf.html#parallel >>>>>> >>>>>> Setting the right number of reducers (PARALLEL or set >> default_parallel) >>>>>> depends on what you are doing with it. If the reducer is CPU intensive >>>>> (may >>>>>> be a complex UDF running on reducer side), you would probably spawn >> more >>>>>> reducers. Otherwise (in most cases), the suggestion in the doc (1 GB >> per >>>>>> reducer) holds good for regular aggregations (SUM, COUNT..). >>>>>> >>>>>> >>>>>> 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker >>>>>> 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB >>>>>> of reduce shuffle bytes and see if it performs well >>>>>> 3. If not, adjust it according to your Reducer heap size. More the >>>>> heap, >>>>>> less is the data spilled to disk. >>>>>> >>>>>> There are a few more properties on the Reduce side (buffer size etc) >> but >>>>>> that probably is not required to start with. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Prashant >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[email protected] >>>>>> wrote: >>>>>> >>>>>>> Pankaj, >>>>>>> >>>>>>> What version of pig are you using? In later versions of pig, it >> should >>>>> have >>>>>>> some logic around automatically setting parallelisms (though >> sometimes >>>>>>> these heuristics will be wrong). >>>>>>> >>>>>>> There are also some operations which will force you to use 1 >> reducer. It >>>>>>> depends on what your script is doing. >>>>>>> >>>>>>> 2012/6/1 Pankaj Gupta <[email protected]> >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I just realized that one of my large scale pig jobs that has 100K >> map >>>>>>> jobs >>>>>>>> actually only has one reduce task. Reading the documentation I see >> that >>>>>>> the >>>>>>>> number of reduce tasks is defined by the PARALLEL clause whose >> default >>>>>>>> value is 1. I have a few questions around this: >>>>>>>> >>>>>>>> # Why is the default value of reduce tasks 1? >>>>>>>> # (Related to first question) Why aren't reduce tasks parallelized >>>>>>>> automatically in Pig? >>>>>>>> # How do I choose a good value of reduce tasks for my pig jobs? >>>>>>>> >>>>>>>> Thanks in Advance, >>>>>>>> Pankaj >>>>>>> >>>>> >>>>> >> > > > > -- > "...:::Aniket:::... Quetzalco@tl"
