[
https://issues.apache.org/jira/browse/PIG-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olga Natkovich resolved PIG-1081.
---------------------------------
Resolution: Duplicate
dumplicate of 1084
> PigCookBook use of PARALLEL keyword
> -----------------------------------
>
> Key: PIG-1081
> URL: https://issues.apache.org/jira/browse/PIG-1081
> Project: Pig
> Issue Type: Bug
> Components: documentation
> Affects Versions: 0.5.0
> Reporter: Viraj Bhat
> Fix For: 0.5.0
>
>
> Hi all,
> I am looking at some tips for optimizing Pig programs (Pig Cookbook) using
> the PARALLEL keyword.
> http://hadoop.apache.org/pig/docs/r0.5.0/cookbook.html#Use+PARALLEL+Keyword
> We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches
> 1 reducer for all cases.
> In this documentation we state that: <num machines> * <num reduce slots per
> machine> * 0.9, this documentation was valid for HoD (Hadoop on Demand) where
> you are creating your own Hadoop clusters, but if you are using:
> Either the Capacity Scheduler
> http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the
> Fair Share Scheduler
> http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these
> numbers could mean that you are using around 90% of your reducer slots in
> your machine.
> We should change this to something like:
> The number of reducers you may need for a particular construct in Pig which
> forms a Map Reduce boundary depends entirely on your data and the number of
> intermediate keys you are generating in your mappers. In best cases we have
> seen that a reducer processing about 500 MB of data behaves efficiently.
> Additionally it is hard to define the optimum number of reducers, since it
> completely depends on the paritioner and distribution of map (combiner)
> output keys.
> Viraj
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.