Hi Takeshi,
Thanks for the answer. My UDAF aggregates data into an array of rows.
Apparently this makes it ineligible to using Hash-based aggregate based on
the logic at:
Hi,
Spark always uses hash-based aggregates if the types of aggregated data are
supported there;
otherwise, spark fails to use hash-based ones, then it uses sort-based ones.
See:
Hi all,
It appears to me that Dataset.groupBy().agg(udaf) requires a full sort,
which is very inefficient for certain aggration:
The code is very simple:
- I have a UDAF
- What I want to do is: dataset.groupBy(cols).agg(udaf).count()
The physical plan I got was:
*HashAggregate(keys=[],