Performance issue with 2 phase hash-agg design

rahul challapalli Tue, 20 Jun 2017 13:37:05 -0700

During the first phase, the hash agg operator is not protected from skew in
data (Eg : data contains 2 files where the number of records in one file is
very large compared to the other). Assuming there are only 2 fragments, the
hash-agg operator in one fragment handles more records and it aggregates
until the memory available to it gets exhausted, at which point it sends
the record batches downstream to the hash-partitioner.


Because the hash-partitioner normalizes the skew in the data, the work is
evenly divided and the 2 minor fragments running the second phase
hash-aggregate take similar amount of processing time.

So what is the problem here? During the first phase one minor fragment
takes a long time which affects the runtime of the query. Instead, if the
first phase did not do any aggregation or only used low memory (there by
limiting the aggregations performed) then the query would have completed
faster. However the advantage of doing 2-phase aggregation is reduced
traffic on the network. But if the keys used in group by are mostly unique
then we loose this advantage as well.

I was playing with the new spillable hash-agg code and observed that
increasing memory did not improve the runtime.  This behavior can be
explained by the above reasoning.

Aggregating on mostly unique keys may not be a common use case, but any
thoughts in general about this?

Performance issue with 2 phase hash-agg design

Reply via email to