xupefei opened a new pull request, #46245: URL: https://github.com/apache/spark/pull/46245
### What changes were proposed in this pull request? This PR changes Spark Connect to support defining and registering `Aggregator[IN, BUF, OUT]` UDAFs. The mechanism is similar to supporting Scaler UDFs. On the client side, we serialize and send the `Aggregator` instance to the server, where the data is deserialized into an `Aggregator` instance recognized by Spark Core. With this PR we now have two `Aggregator` interfaces defined, one in Connect API and one in Core. They define exactly the same abstract methods and share the same `SerialVersionUID`, so the Java serialization engine could map one to another. It is very important to keep these two definitions always in sync. Second part of this effort will be adding `Aggregator.toColumn` API (now NotImplemented due to deps to Spark Core). ### Why are the changes needed? Spark Connect does not have UDAF support. We need to fix that. ### Does this PR introduce _any_ user-facing change? Yes, Connect users could now define an Aggregator and register it: ```scala val agg = new Aggregator[INT, INT, INT] { ... } spark.udf.register("agg", udaf(agg)) val ds: Dataset[Data] = ... val aggregated = ds.selectExpr("agg(i)") ``` ### How was this patch tested? Added new tests. ### Was this patch authored or co-authored using generative AI tooling? Nope. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org