[ https://issues.apache.org/jira/browse/PIG-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich resolved PIG-1081. --------------------------------- Resolution: Duplicate dumplicate of 1084 > PigCookBook use of PARALLEL keyword > ----------------------------------- > > Key: PIG-1081 > URL: https://issues.apache.org/jira/browse/PIG-1081 > Project: Pig > Issue Type: Bug > Components: documentation > Affects Versions: 0.5.0 > Reporter: Viraj Bhat > Fix For: 0.5.0 > > > Hi all, > I am looking at some tips for optimizing Pig programs (Pig Cookbook) using > the PARALLEL keyword. > http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword > We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches > 1 reducer for all cases. > In this documentation we state that: <num machines> * <num reduce slots per > machine> * 0.9, this documentation was valid for HoD (Hadoop on Demand) where > you are creating your own Hadoop clusters, but if you are using: > Either the Capacity Scheduler > http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the > Fair Share Scheduler > http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these > numbers could mean that you are using around 90% of your reducer slots in > your machine. > We should change this to something like: > The number of reducers you may need for a particular construct in Pig which > forms a Map Reduce boundary depends entirely on your data and the number of > intermediate keys you are generating in your mappers. In best cases we have > seen that a reducer processing about 500 MB of data behaves efficiently. > Additionally it is hard to define the optimum number of reducers, since it > completely depends on the paritioner and distribution of map (combiner) > output keys. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.