[ 
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782491#comment-16782491
 ] 

Sean Owen commented on SPARK-26257:
-----------------------------------

I'm not sure how much of the Pyspark and SparkR integration is meaningfully 
refactorable to support other languages. They both wrap Spark, effectively. I'm 
not sure therefore it's necessary or a good thing to further modify Spark for 
these things, if they already manage to support Python and R. With the 
exception of C#, those are niche languages anyway. I'd be open to seeing 
concrete proposals.

> SPIP: Interop Support for Spark Language Extensions
> ---------------------------------------------------
>
>                 Key: SPARK-26257
>                 URL: https://issues.apache.org/jira/browse/SPARK-26257
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, R, Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Tyson Condie
>            Priority: Minor
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some 
> notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated 
> interop layer. It would seem that much of that existing interop layer could 
> be refactored into a clean surface for general (third party) language 
> bindings, such as the above mentioned. More specifically, could we generalize 
> the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at 
> the RDD level may be too heavy-weight and unnecessary given the preference 
> towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it 
> easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language 
> extension should be implemented
>  * Provide general reusable libraries that are not specific to any language 
> extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in 
> languages other than what’s natively supported. Library developers will be 
> able to create language extensions for Spark in a clean way. The interop 
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it 
> aims to provide a stable interop layer for third party language extensions to 
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark 
> and R, specifically for the Dataframe API. For instance, it would be good to 
> have a general deploy.Runner (similar to PythonRunner) for language extension 
> efforts. In Spark SQL, it would be good to have a general InteropUDF and 
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; 
> depending on the success of the interop layer and with proper demand, RDD 
> interop could be added later. However, one open question is supporting a 
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a driver/application, 
> running an interop UDF along with any other low-level transformations that 
> aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop 
> layer. This port should be contained solely to the Spark core side, and all 
> protocols specific to Python and R should not change e.g., Python should 
> continue to use py4j is the protocol between the Python process and core 
> Spark. The port itself should be contained to a handful of files e.g., some 
> examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, 
> PythonRDD (possibly), and will mostly involve refactoring common logic 
> abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a 
> third-party language extension to Spark do so directly; often by extending 
> existing Python classes and overriding the portions that are relevant to the 
> new extension. Not only is this not sound code (e.g., an JuliaRDD is not a 
> PythonRDD, which contains a lot of reusable code), but it runs the great risk 
> of future revisions making the subclass implementation obsolete. It would be 
> hard to imagine that any third-party language extension would be successful 
> if there was not something in place to guarantee its long-term 
> maintainability. 
> Another alternative is that third-party languages should only interact with 
> Spark via pure-SQL; possibly via REST. However, this does not enable UDFs 
> written in the third-party language; a key desideratum in this effort, which 
> most notably takes the form of legacy code/UDFs that would need to be ported 
> to a supported language e.g., Scala. This exercise is extremely cumbersome 
> and not always feasible due to the code no longer being available i.e., only 
> the compiled library exists.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to