[ https://issues.apache.org/jira/browse/PIG-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098190#comment-13098190 ]
Thejas M Nair commented on PIG-2228: ------------------------------------ bq. Do you have any benchmark results on this yet? I am running some performance tests, I will publish numbers in a day or two. I will do 'submit patch' after that. bq. It'd be great if you could post this on ReviewBoard, easier to comment that way. Will do that once I am done with the performance tests. > support partial aggregation in map task > --------------------------------------- > > Key: PIG-2228 > URL: https://issues.apache.org/jira/browse/PIG-2228 > Project: Pig > Issue Type: Bug > Reporter: Thejas M Nair > Assignee: Thejas M Nair > Fix For: 0.10 > > Attachments: PIG-2228.1.patch, PIG-2228.2.patch, PIG-2228.3.patch > > > h3. Introduction > Pig does (sort based) partial aggregation in map side through the use of > combiner. MR serializes the output of map to a buffer, sorts it on the keys, > deserializes and passes the values grouped on the keys to combiner phase. The > same work of combiner can be done in the map phase itself by using a hash-map > on the keys. This hash based (partial) aggregation can be done with or > without a combiner phase. > h3. Benefits > It will send fewer records to combiner and thereby - > * Save on cost of serializing and de-serializing > * Save on cost of lock calls on the combiner input buffer. (I have found > this to be a significant cost for a query that was doing multiple group-by's > in a single MR job. -Thejas) > * The problem of running out of memory in reduce side, for queries like > COUNT(distinct col) can be avoided. The OOM issue happens because very large > records get created after the combiner run on merged reduce input. In case of > combiner, you have no way of telling MR not to combine records in reduce > side. The workaround is to disable combiner completely, and the opportunity > to reduce map output size is lost. > * When the foreach after group-by has both algebraic and non-algebraic > functions, or if a bag is being projected, the combiner is not used. This is > because the data size reduction in typical cases are not significant enough > to justify the additional (de)serialization costs. But hash based aggregation > can be used in such cases as well. > * It is possible to turn off the in-map combine automatically if there is > not enough 'combination' that is taking place to justify the overhead of the > in-map combiner. (Idea borrowed from Hive jira.) > * If input data is sorted, it is possible to do efficient map side > (partial) aggregation with in-map combiner. > Design proposal is here - > https://cwiki.apache.org/confluence/display/PIG/PigInMapCombinerProposal -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira