support partial aggregation in map task
---------------------------------------
Key: PIG-2228
URL: https://issues.apache.org/jira/browse/PIG-2228
Project: Pig
Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
Fix For: 0.10
h3. Introduction
Pig does (sort based) partial aggregation in map side through the use of
combiner. MR serializes the output of map to a buffer, sorts it on the keys,
deserializes and passes the values grouped on the keys to combiner phase. The
same work of combiner can be done in the map phase itself by using a hash-map
on the keys. This hash based (partial) aggregation can be done with or without
a combiner phase.
h3. Benefits
It will send fewer records to combiner and thereby -
* Save on cost of serializing and de-serializing
* Save on cost of lock calls on the combiner input buffer. (I have found this
to be a significant cost for a query that was doing multiple group-by's in a
single MR job. -Thejas)
* The problem of running out of memory in reduce side, for queries like
COUNT(distinct col) can be avoided. The OOM issue happens because very large
records get created after the combiner run on merged reduce input. In case of
combiner, you have no way of telling MR not to combine records in reduce side.
The workaround is to disable combiner completely, and the opportunity to reduce
map output size is lost.
* When the foreach after group-by has both algebraic and non-algebraic
functions, or if a bag is being projected, the combiner is not used. This is
because the data size reduction in typical cases are not significant enough to
justify the additional (de)serialization costs. But hash based aggregation can
be used in such cases as well.
* It is possible to turn off the in-map combine automatically if there is not
enough 'combination' that is taking place to justify the overhead of the in-map
combiner. (Idea borrowed from Hive jira.)
* If input data is sorted, it is possible to do efficient map side (partial)
aggregation with in-map combiner.
Design proposal is here -
https://cwiki.apache.org/confluence/display/PIG/PigInMapCombinerProposal
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira