During the first phase, the hash agg operator is not protected from skew in data (Eg : data contains 2 files where the number of records in one file is very large compared to the other). Assuming there are only 2 fragments, the hash-agg operator in one fragment handles more records and it aggregates until the memory available to it gets exhausted, at which point it sends the record batches downstream to the hash-partitioner.
Because the hash-partitioner normalizes the skew in the data, the work is evenly divided and the 2 minor fragments running the second phase hash-aggregate take similar amount of processing time. So what is the problem here? During the first phase one minor fragment takes a long time which affects the runtime of the query. Instead, if the first phase did not do any aggregation or only used low memory (there by limiting the aggregations performed) then the query would have completed faster. However the advantage of doing 2-phase aggregation is reduced traffic on the network. But if the keys used in group by are mostly unique then we loose this advantage as well. I was playing with the new spillable hash-agg code and observed that increasing memory did not improve the runtime. This behavior can be explained by the above reasoning. Aggregating on mostly unique keys may not be a common use case, but any thoughts in general about this?
