[ https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424140#comment-13424140 ]
Thejas M Nair commented on PIG-2829: ------------------------------------ Thanks for the benchmark Jie. Clearly, partial-agg is working better than combiner. Can you also run some benchmarks with combiner turned off, so that we can verify the appropriate value for pig.exec.mapPartAgg.minReduction - ||query || combiner off, partial-agg off || combiner off, partial-agg on || |g-by with reduction by 3 | | | |g-by with reduction by 2| | | > Use partial aggregation more aggresively > ---------------------------------------- > > Key: PIG-2829 > URL: https://issues.apache.org/jira/browse/PIG-2829 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.10.0 > Reporter: Jie Li > Attachments: 2829.1.patch, 2829.2.patch, 2829.separate.options.patch, > pigmix-10G.png, tpch-10G.png > > > Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature > in Pig 0.10 that will perform aggregation within map function. The main > advantage against combiner is it avoids de/serializing and sorting the data, > and it can auto disable itself if the data reduction rate is low. Currently > it's disabled by default. > To leverage the power of PartialAgg more aggressively, several things need to > be revisited: > 1. The threshold of auto-disabling. Currently each mapper looks at first 1k > (hard-coded) records to see if there's enough data size reduction (defaults > to 10x, configurable). The check would happen earlier if the hash table gets > full before processing the 1k records (hash table size is controlled by > pig.cachedbag.memusage). We might want to relax these thresholds. > 2. Dependency on the combiner. Currently the PartialAgg won't work without a > combiner following it, so we need to provide separate options to enable each > independently. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira