Eyal Allweil created DATAFU-173:
-----------------------------------
Summary: Change UDAFS to use Aggregator instead of
UserDefinedAggregateFunction
Key: DATAFU-173
URL: https://issues.apache.org/jira/browse/DATAFU-173
Project: DataFu
Issue Type: Improvement
Reporter: Eyal Allweil
Fix For: 2.0.0
Currently our UDAFs use the
[UserDefinedAggregateFunction|https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/expressions/UserDefinedAggregateFunction.html]
class. There are two drawbacks with this:
1) It is less efficient than Aggregator
2) UserDefinedAggregateFunction is deprecated and removed from Spark 3.2.0.
This story is for changing them to use
[Aggregator|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/expressions/Aggregator.html].
The UDAFs are located here:
[https://github.com/apache/datafu/blob/main/datafu-spark/src/main/scala/datafu/spark/SparkUDAFs.scala]
Here are some links explaining how to do this:
[https://stackoverflow.com/questions/48180598/spark-what-is-the-difference-between-aggregator-and-udaf]
[https://stackoverflow.com/questions/66808917/apache-spark-how-to-define-a-userdefinedaggregatefunction-after-3]
This change should be backwards compatible if possible; the tests in
[TestSparkUDAFs|https://github.com/apache/datafu/blob/main/datafu-spark/src/test/scala/datafu/spark/TestSparkUDAFs.scala]
should all still pass.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)