Hi all, I’d like to start a discussion on a draft SPIP: Language-agnostic UDF Protocol for Spark
JIRA: https://issues.apache.org/jira/browse/SPARK-55278 Doc: https://docs.google.com/document/d/19Whzq127QxVt2Luk0EClgaDtcpBsFUp67NcVdKKyPF8/edit?tab=t.0 tl;dr The SPIP proposes a structured, language-agnostic execution protocol for running user-defined functions (UDFs) in Spark across multiple programming languages. Today, Spark Connect allows users to write queries from multiple languages, but support for user-defined functions remains incomplete. In practice, only Scala / Java / Python / R have working support, and it relies on language-specific mechanisms that do not generalize well to other languages such as Go (Apache Spark Connect Go <https://github.com/apache/spark-connect-go>), Rust (Apache Spark Connect Rust <https://github.com/apache/spark-connect-rust>), Swift (Apache Spark Connect Swift <https://github.com/apache/spark-connect-swift>), or .NET (Spark Connect DotNet <https://github.com/GoEddie/spark-connect-dotnet>), where UDF support is currently unavailable. There are also legacy limitations around the existing PySpark worker.py implementation that can be improved with the proposal. This proposal aims to define a unified API and execution protocol for UDFs that run outside the Spark executor process and communicate with Spark via inter-process communication (IPC). The goal is to enable Spark to interact with external workers in a consistent and extensible way, regardless of the implementation language. I’m happy to help drive the discussion and development of this proposal, and I would greatly appreciate feedback from the community. Thanks, Haiyang Sun
