[ https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jie Li updated PIG-2829: ------------------------ Attachment: 2829.1.patch Attached patch with all the rework (mostly learnt from hive): 1: separates options to enable combiner and mapagg 2: changes existing default values: ||property||old default value||new default value||comment|| |pig.exec.nocombiner|false|true| disable combiner by default| |pig.exec.mapPartAgg|false|true| enable mapagg by default| |pig.exec.mapPartAgg.minReduction|10|2.0| more aggressive. also change from int to double| 3: adds a property pig.exec.mapPartAgg.reduction.checkinterval which defaults to 100k, so after processing every 100k records mapagg will check the reduction rate to see if it should be disabled. Previously we only look at first 1000 records. 4: previously the reduction check would also happen if the hash map gets full. The patch removes this condition and instead it keeps track of the total new hash map entries, so the reduction check will only be triggered by pig.exec.mapPartAgg.reduction.checkinterval, which is easier to control. Welcome to give any comment! Will work on fixing unit tests and performance testing. > Use partial aggregation more aggresively > ---------------------------------------- > > Key: PIG-2829 > URL: https://issues.apache.org/jira/browse/PIG-2829 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.10.0 > Reporter: Jie Li > Attachments: 2829.1.patch, 2829.separate.options.patch, > pigmix-10G.png, tpch-10G.png > > > Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature > in Pig 0.10 that will perform aggregation within map function. The main > advantage against combiner is it avoids de/serializing and sorting the data, > and it can auto disable itself if the data reduction rate is low. Currently > it's disabled by default. > To leverage the power of PartialAgg more aggressively, several things need to > be revisited: > 1. The threshold of auto-disabling. Currently each mapper looks at first 1k > (hard-coded) records to see if there's enough data size reduction (defaults > to 10x, configurable). The check would happen earlier if the hash table gets > full before processing the 1k records (hash table size is controlled by > pig.cachedbag.memusage). We might want to relax these thresholds. > 2. Dependency on the combiner. Currently the PartialAgg won't work without a > combiner following it, so we need to provide separate options to enable each > independently. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira