[ https://issues.apache.org/jira/browse/DATAFU-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799303#comment-17799303 ]
Eyal Allweil commented on DATAFU-173: ------------------------------------- It looks like complete backwards compatibility won't be possible, because before using a UDAF you needed code like this: {code:scala} val countDistinctUpTo3 = new CountDistinctUpTo(2) spark.table("input").agg(countDistinctUpTo3($"fldname") {code} and when using Aggregators you need to wrap the class with the _udaf_ method, like this: {code:scala} val countDistinctUpTo3 = udaf(new CountDistinctUpTo(2)) spark.table("input").agg(countDistinctUpTo3($"fldname") {code} So I propose making a new class, _Aggregators_, with a new test class _TestAggregators_, and we'll put it in our first Spark 3 release while deprecating the UDAFs, and then we can remove them for the Spark 3.2.x release. > Change UDAFS to use Aggregator instead of UserDefinedAggregateFunction > ---------------------------------------------------------------------- > > Key: DATAFU-173 > URL: https://issues.apache.org/jira/browse/DATAFU-173 > Project: DataFu > Issue Type: Improvement > Reporter: Eyal Allweil > Priority: Major > Fix For: 2.0.0 > > > Currently our UDAFs use the > [UserDefinedAggregateFunction|https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/expressions/UserDefinedAggregateFunction.html] > class. There are two drawbacks with this: > 1) It is less efficient than Aggregator > 2) UserDefinedAggregateFunction is deprecated and removed from Spark 3.2.0. > > This story is for changing them to use > [Aggregator|https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/expressions/Aggregator.html]. > > The UDAFs are located here: > [https://github.com/apache/datafu/blob/main/datafu-spark/src/main/scala/datafu/spark/SparkUDAFs.scala] > > Here are some links explaining how to do this: > [https://stackoverflow.com/questions/48180598/spark-what-is-the-difference-between-aggregator-and-udaf] > [https://stackoverflow.com/questions/66808917/apache-spark-how-to-define-a-userdefinedaggregatefunction-after-3] > > This change should be backwards compatible if possible; the tests in > [TestSparkUDAFs|https://github.com/apache/datafu/blob/main/datafu-spark/src/test/scala/datafu/spark/TestSparkUDAFs.scala] > should all still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)