[
https://issues.apache.org/jira/browse/PIG-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thejas M Nair updated PIG-2228:
-------------------------------
Attachment: PIG-2228.5.patch
PIG-2228.5.patch - added description of the two new properties introduced in
this jira, in pig usage (output of '-h properties')
> support partial aggregation in map task
> ---------------------------------------
>
> Key: PIG-2228
> URL: https://issues.apache.org/jira/browse/PIG-2228
> Project: Pig
> Issue Type: Bug
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.10
>
> Attachments: PIG-2228.1.patch, PIG-2228.2.patch, PIG-2228.3.patch,
> PIG-2228.4.patch, PIG-2228.5.patch
>
>
> h3. Introduction
> Pig does (sort based) partial aggregation in map side through the use of
> combiner. MR serializes the output of map to a buffer, sorts it on the keys,
> deserializes and passes the values grouped on the keys to combiner phase. The
> same work of combiner can be done in the map phase itself by using a hash-map
> on the keys. This hash based (partial) aggregation can be done with or
> without a combiner phase.
> h3. Benefits
> It will send fewer records to combiner and thereby -
> * Save on cost of serializing and de-serializing
> * Save on cost of lock calls on the combiner input buffer. (I have found
> this to be a significant cost for a query that was doing multiple group-by's
> in a single MR job. -Thejas)
> * The problem of running out of memory in reduce side, for queries like
> COUNT(distinct col) can be avoided. The OOM issue happens because very large
> records get created after the combiner run on merged reduce input. In case of
> combiner, you have no way of telling MR not to combine records in reduce
> side. The workaround is to disable combiner completely, and the opportunity
> to reduce map output size is lost.
> * When the foreach after group-by has both algebraic and non-algebraic
> functions, or if a bag is being projected, the combiner is not used. This is
> because the data size reduction in typical cases are not significant enough
> to justify the additional (de)serialization costs. But hash based aggregation
> can be used in such cases as well.
> * It is possible to turn off the in-map combine automatically if there is
> not enough 'combination' that is taking place to justify the overhead of the
> in-map combiner. (Idea borrowed from Hive jira.)
> * If input data is sorted, it is possible to do efficient map side
> (partial) aggregation with in-map combiner.
> Design proposal is here -
> https://cwiki.apache.org/confluence/display/PIG/PigInMapCombinerProposal
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira