[ https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793842#comment-16793842 ]
Tyson Condie commented on SPARK-26257: -------------------------------------- Thank you [~abellina] [~srowen] [~zjffdu] for your interest in this SPIP. Our Microsoft team has been working on the C# bindings for Spark, and while doing this work, we are thinking hard about what a generic interop layer might look like. First, much of our C# bindings reuse existing Python interop support. This includes, the Catalyst rules that plan UDF physical operators (copied), establishing workers (Runners) for UDF execution, piping data to/from an external worker process, control channels in the driver that forward operations on C# proxy objects, all of which mirror existing Python and R interop mechanisms. [~imback82] (who is doing the bulk of work on the C# bindings) may have other examples of overlapping logic. As such, I think a single extensible interop layer makes sense for Spark. It would centralize some of the unit tests that are presently spread across Python and R extensions. And more importantly, it would reduce the code footprint in Spark. Beyond that, it sends a good message to other communities (.NET, Go, Haskell, Julia) that Spark is a programming language friendly platform for data analytics. As I mentioned earlier, we have been learning thus far, and we are now starting to collect our findings in a concrete design doc / proposal (that [~imback82] will be sharing soon) to the Spark community. Thank you again for your interest! > SPIP: Interop Support for Spark Language Extensions > --------------------------------------------------- > > Key: SPARK-26257 > URL: https://issues.apache.org/jira/browse/SPARK-26257 > Project: Spark > Issue Type: Improvement > Components: PySpark, R, Spark Core > Affects Versions: 2.4.0 > Reporter: Tyson Condie > Priority: Minor > > h2. ** Background and Motivation: > There is a desire for third party language extensions for Apache Spark. Some > notable examples include: > * C#/F# from project Mobius [https://github.com/Microsoft/Mobius] > * Haskell from project sparkle [https://github.com/tweag/sparkle] > * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl] > Presently, Apache Spark supports Python and R via a tightly integrated > interop layer. It would seem that much of that existing interop layer could > be refactored into a clean surface for general (third party) language > bindings, such as the above mentioned. More specifically, could we generalize > the following modules: > * Deploy runners (e.g., PythonRunner and RRunner) > * DataFrame Executors > * RDD operations? > The last being questionable: integrating third party language extensions at > the RDD level may be too heavy-weight and unnecessary given the preference > towards the Dataframe abstraction. > The main goals of this effort would be: > * Provide a clean abstraction for third party language extensions making it > easier to maintain (the language extension) with the evolution of Apache Spark > * Provide guidance to third party language authors on how a language > extension should be implemented > * Provide general reusable libraries that are not specific to any language > extension > * Open the door to developers that prefer alternative languages > * Identify and clean up common code shared between Python and R interops > h2. Target Personas: > Data Scientists, Data Engineers, Library Developers > h2. Goals: > Data scientists and engineers will have the opportunity to work with Spark in > languages other than what’s natively supported. Library developers will be > able to create language extensions for Spark in a clean way. The interop > layer should also provide guidance for developing language extensions. > h2. Non-Goals: > The proposal does not aim to create an actual language extension. Rather, it > aims to provide a stable interop layer for third party language extensions to > dock. > h2. Proposed API Changes: > Much of the work will involve generalizing existing interop APIs for PySpark > and R, specifically for the Dataframe API. For instance, it would be good to > have a general deploy.Runner (similar to PythonRunner) for language extension > efforts. In Spark SQL, it would be good to have a general InteropUDF and > evaluator (similar to BatchEvalPythonExec). > Low-level RDD operations should not be needed in this initial offering; > depending on the success of the interop layer and with proper demand, RDD > interop could be added later. However, one open question is supporting a > subset of low-level functions that are core to ETL e.g., transform. > h2. Optional Design Sketch: > The work would be broken down into two top-level phases: > Phase 1: Introduce general interop API for deploying a driver/application, > running an interop UDF along with any other low-level transformations that > aid with ETL. > Phase 2: Port existing Python and R language extensions to the new interop > layer. This port should be contained solely to the Spark core side, and all > protocols specific to Python and R should not change e.g., Python should > continue to use py4j is the protocol between the Python process and core > Spark. The port itself should be contained to a handful of files e.g., some > examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, > PythonRDD (possibly), and will mostly involve refactoring common logic > abstract implementations and utilities. > h2. Optional Rejected Designs: > The clear alternative is the status quo; developers that want to provide a > third-party language extension to Spark do so directly; often by extending > existing Python classes and overriding the portions that are relevant to the > new extension. Not only is this not sound code (e.g., an JuliaRDD is not a > PythonRDD, which contains a lot of reusable code), but it runs the great risk > of future revisions making the subclass implementation obsolete. It would be > hard to imagine that any third-party language extension would be successful > if there was not something in place to guarantee its long-term > maintainability. > Another alternative is that third-party languages should only interact with > Spark via pure-SQL; possibly via REST. However, this does not enable UDFs > written in the third-party language; a key desideratum in this effort, which > most notably takes the form of legacy code/UDFs that would need to be ported > to a supported language e.g., Scala. This exercise is extremely cumbersome > and not always feasible due to the code no longer being available i.e., only > the compiled library exists. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org