Automatic Heuristic works the same in 0.9.1 http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be better off setting it manually looking at job tracker counters.
You should be fine with using PARALLEL for any of the operators mentioned on the doc. -Prashant On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta <[email protected]> wrote: > Hi Prashant, > > Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems like a > very useful upgrade. For the moment though it seems that I should be able > to use the 1GB per reducer heuristic and specify the number of reducers in > Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this sound > right? > > Thanks, > Pankaj > > > On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote: > > > Also, please note default number of reducers are based on input dataset. > In > > the basic case, Pig will "automatically" spawn a reducer for each GB of > > input, so if your input dataset size is 500 GB you should see 500 > reducers > > being spawned (though this is excessive in a lot of cases). > > > > This document talks about parallelism > > http://pig.apache.org/docs/r0.10.0/perf.html#parallel > > > > Setting the right number of reducers (PARALLEL or set default_parallel) > > depends on what you are doing with it. If the reducer is CPU intensive > (may > > be a complex UDF running on reducer side), you would probably spawn more > > reducers. Otherwise (in most cases), the suggestion in the doc (1 GB per > > reducer) holds good for regular aggregations (SUM, COUNT..). > > > > > > 1. Take a look at Reduce Shuffle Bytes for the job on JobTracker > > 2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB > > of reduce shuffle bytes and see if it performs well > > 3. If not, adjust it according to your Reducer heap size. More the > heap, > > less is the data spilled to disk. > > > > There are a few more properties on the Reduce side (buffer size etc) but > > that probably is not required to start with. > > > > Thanks, > > > > Prashant > > > > > > > > > > On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney <[email protected] > >wrote: > > > >> Pankaj, > >> > >> What version of pig are you using? In later versions of pig, it should > have > >> some logic around automatically setting parallelisms (though sometimes > >> these heuristics will be wrong). > >> > >> There are also some operations which will force you to use 1 reducer. It > >> depends on what your script is doing. > >> > >> 2012/6/1 Pankaj Gupta <[email protected]> > >> > >>> Hi, > >>> > >>> I just realized that one of my large scale pig jobs that has 100K map > >> jobs > >>> actually only has one reduce task. Reading the documentation I see that > >> the > >>> number of reduce tasks is defined by the PARALLEL clause whose default > >>> value is 1. I have a few questions around this: > >>> > >>> # Why is the default value of reduce tasks 1? > >>> # (Related to first question) Why aren't reduce tasks parallelized > >>> automatically in Pig? > >>> # How do I choose a good value of reduce tasks for my pig jobs? > >>> > >>> Thanks in Advance, > >>> Pankaj > >> > >
