Rohini Palaniswamy created PIG-5378:
---------------------------------------
Summary: Optimize DISTINCT COUNT inside foreach
Key: PIG-5378
URL: https://issues.apache.org/jira/browse/PIG-5378
Project: Pig
Issue Type: Improvement
Reporter: Rohini Palaniswamy
When there is DISTINCT COUNT, the combiner is usually applied. In too many of
our scripts, have seen that the DISTINCT bag grows to 10s of thousands or
millions of items making the hash aggregation really worse. Even if hash
aggregation is turned off, the combiner will still aggregate and in the reducer
there is way too much spill because of big bag.
This can be avoided if we apply secondary sort with ordering and make it use
POSortedDistinct. Just PODistinct is still not good enough as it will need to
hold all the elements in a HashSet. POSortedDistinct requires no memory at all.
Two things to be done:
1) If we see a distinct count, turn it into a POSortedDistinct using
SecondaryKeyOptimizer. Currently CombinerOptimizer runs first. We need to turn
off applying combiner optimizer for distinct. Can make this configurable using
pig.optimize.nested.distinct = true and keep it default in our clusters.
2) SecondaryKeyOptimizer is not converting it into POSortedDistinct in below
case because of a POForEach in plan before PODistinct
(https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java#L529-L533).
{code}
B = GROUP A BY f1;
C = FOREACH B {
sorted = ORDER A by f2;
unique = DISTINCT sorted.f2;
GENERATE group, COUNT(unique) as cnt;
}
{code}
does not generate POSortedDistinct and has to be fixed. Worked around by doing
{code}
B = GROUP A BY f1;
C = FOREACH B {
fields = A.f2;
sorted = ORDER A by f2;
unique = DISTINCT sorted;
GENERATE group, COUNT(unique) as cnt;
}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)