[jira] [Created] (PIG-5378) Optimize DISTINCT COUNT inside foreach

Rohini Palaniswamy (JIRA) Tue, 22 Jan 2019 11:04:42 -0800

Rohini Palaniswamy created PIG-5378:
---------------------------------------


             Summary: Optimize DISTINCT COUNT inside foreach
                 Key: PIG-5378
                 URL: https://issues.apache.org/jira/browse/PIG-5378
             Project: Pig
          Issue Type: Improvement
            Reporter: Rohini Palaniswamy


When there is DISTINCT COUNT, the combiner is usually applied. In too many of 
our scripts, have seen that the DISTINCT bag grows to 10s of thousands or 
millions of items making the hash aggregation really worse. Even if hash 
aggregation is turned off, the combiner will still aggregate and in the reducer 
there is way too much spill because of big bag.

This can be avoided if we apply secondary sort with ordering and make it use 
POSortedDistinct. Just PODistinct is still not good enough as it will need to 
hold all the elements in a HashSet. POSortedDistinct requires no memory at all.

Two things to be done:
1) If we see a distinct count, turn it into a POSortedDistinct using 
SecondaryKeyOptimizer. Currently CombinerOptimizer runs first. We need to turn 
off applying combiner optimizer for distinct. Can make this configurable using 
pig.optimize.nested.distinct = true and keep it default in our clusters.
2) SecondaryKeyOptimizer is not converting it into POSortedDistinct in below 
case because of a POForEach in plan before PODistinct 
(https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java#L529-L533).

{code}
B = GROUP A BY f1;
C = FOREACH B {
        sorted = ORDER A by f2;
        unique = DISTINCT sorted.f2;
        GENERATE group, COUNT(unique) as cnt;
}
{code}

does not generate POSortedDistinct and has to be fixed. Worked around by doing

{code}
B = GROUP A BY f1;
C = FOREACH B {
        fields = A.f2;
        sorted = ORDER A by f2;
        unique = DISTINCT sorted;
        GENERATE group, COUNT(unique) as cnt;
}






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PIG-5378) Optimize DISTINCT COUNT inside foreach

Reply via email to