Jie Li created PIG-2829:
---------------------------

             Summary: Use partial aggregation more aggresively
                 Key: PIG-2829
                 URL: https://issues.apache.org/jira/browse/PIG-2829
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.10.0
            Reporter: Jie Li


Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature in 
Pig 0.10 that will perform aggregation within map function. The main advantage 
against combiner is it avoids de/serializing and sorting the data, and it can 
auto disable itself if the data reduction rate is low. Currently it's disabled 
by default.

To leverage the power of PartialAgg more aggressively, several things need to 
be revisited:

1. The threshold of auto-disabling. Currently each mapper looks at first 1k 
(hard-coded) records to see if there's enough data size reduction (defaults to 
10x, configurable). The check would happen earlier if the hash table gets full 
before processing the 1k records (hash table size is controlled by 
pig.cachedbag.memusage). We might want to relax these thresholds.

2. Dependency on the combiner. Currently the PartialAgg won't work without a 
combiner following it, so we need to provide separate options to enable each 
independently. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to