Tyson Condie created SPARK-26257:
------------------------------------

             Summary: SPIP: Interop Support for Spark Language Extensions
                 Key: SPARK-26257
                 URL: https://issues.apache.org/jira/browse/SPARK-26257
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, R, Spark Core
    Affects Versions: 2.4.0
            Reporter: Tyson Condie


h2.  ** Background and Motivation:

There is a desire for third party language extensions for Apache Spark. Some 
notable examples include:
 * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
 * Haskell from project sparkle [https://github.com/tweag/sparkle]
 * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]

Presently, Apache Spark supports Python and R via a tightly integrated interop 
layer. It would seem that much of that existing interop layer could be 
refactored into a clean surface for general (third party) language bindings, 
such as the above mentioned. More specifically, could we generalize the 
following modules:
 * Deploy runners (e.g., PythonRunner and RRunner)
 * DataFrame Executors
 * RDD operations?

The last being questionable: integrating third party language extensions at the 
RDD level may be too heavy-weight and unnecessary given the preference towards 
the Dataframe abstraction.

The main goals of this effort would be:
 * Provide a clean abstraction for third party language extensions making it 
easier to maintain (the language extension) with the evolution of Apache Spark
 * Provide guidance to third party language authors on how a language extension 
should be implemented
 * Provide general reusable libraries that are not specific to any language 
extension
 * Open the door to developers that prefer alternative languages
 * Identify and clean up common code shared between Python and R interops

h2. Target Personas:

Data Scientists, Data Engineers, Library Developers
h2. Goals:

Data scientists and engineers will have the opportunity to work with Spark in 
languages other than what’s natively supported. Library developers will be able 
to create language extensions for Spark in a clean way. The interop layer 
should also provide guidance for developing language extensions.
h2. Non-Goals:

The proposal does not aim to create an actual language extension. Rather, it 
aims to provide a stable interop layer for third party language extensions to 
dock.
h2. Proposed API Changes:

Much of the work will involve generalizing existing interop APIs for PySpark 
and R, specifically for the Dataframe API. For instance, it would be good to 
have a general deploy.Runner (similar to PythonRunner) for language extension 
efforts. In Spark SQL, it would be good to have a general InteropUDF and 
evaluator (similar to BatchEvalPythonExec).

Low-level RDD operations should not be needed in this initial offering; 
depending on the success of the interop layer and with proper demand, RDD 
interop could be added later. However, one open question is supporting a subset 
of low-level functions that are core to ETL e.g., transform.
h2. Optional Design Sketch:

The work would be broken down into two top-level phases:
 Phase 1: Introduce general interop API for deploying a driver/application, 
running an interop UDF along with any other low-level transformations that aid 
with ETL.

Phase 2: Port existing Python and R language extensions to the new interop 
layer. This port should be contained solely to the Spark core side, and all 
protocols specific to Python and R should not change e.g., Python should 
continue to use py4j is the protocol between the Python process and core Spark. 
The port itself should be contained to a handful of files e.g., some examples 
for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, PythonRDD 
(possibly), and will mostly involve refactoring common logic abstract 
implementations and utilities.
h2. Optional Rejected Designs:

The clear alternative is the status quo; developers that want to provide a 
third-party language extension to Spark do so directly; often by extending 
existing Python classes and overriding the portions that are relevant to the 
new extension. Not only is this not sound code (e.g., an JuliaRDD is not a 
PythonRDD, which contains a lot of reusable code), but it runs the great risk 
of future revisions making the subclass implementation obsolete. It would be 
hard to imagine that any third-party language extension would be successful if 
there was not something in place to guarantee its long-term maintainability. 

Another alternative is that third-party languages should only interact with 
Spark via pure-SQL; possibly via REST. However, this does not enable UDFs 
written in the third-party language; a key desideratum in this effort, which 
most notably takes the form of legacy code/UDFs that would need to be ported to 
a supported language e.g., Scala. This exercise is extremely cumbersome and not 
always feasible due to the code no longer being available i.e., only the 
compiled library exists.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to