xupefei opened a new pull request, #46245:
URL: https://github.com/apache/spark/pull/46245

   ### What changes were proposed in this pull request?
   
   This PR changes Spark Connect to support defining and registering 
`Aggregator[IN, BUF, OUT]` UDAFs.
   The mechanism is similar to supporting Scaler UDFs. On the client side, we 
serialize and send the `Aggregator` instance to the server, where the data is 
deserialized into an `Aggregator` instance recognized by Spark Core.
   With this PR we now have two `Aggregator` interfaces defined, one in Connect 
API and one in Core. They define exactly the same abstract methods and share 
the same `SerialVersionUID`, so the Java serialization engine could map one to 
another. It is very important to keep these two definitions always in sync.
   
   Second part of this effort will be adding `Aggregator.toColumn` API (now 
NotImplemented due to deps to Spark Core).
   
   ### Why are the changes needed?
   
   Spark Connect does not have UDAF support. We need to fix that.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, Connect users could now define an Aggregator and register it:
   ```scala
   val agg = new Aggregator[INT, INT, INT] { ... }
   spark.udf.register("agg", udaf(agg))
   val ds: Dataset[Data] = ...
   val aggregated = ds.selectExpr("agg(i)")
   ```
   
   ### How was this patch tested?
   
   Added new tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Nope.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to