There seems to be some desire for third party language extensions for Apache
Spark. Some notable examples include:

*       C#/F# from project Mobius https://github.com/Microsoft/Mobius
*       Haskell from project sparkle https://github.com/tweag/sparkle
*       Julia from project Spark.jl https://github.com/dfdx/Spark.jl

 

Presently, Apache Spark supports Python and R via a tightly integrated
interop layer. It would seem that much of that existing interop layer could
be refactored into a clean surface for general (third party) language
bindings, such as the above mentioned. More specifically, could we
generalize the following modules:

1.      Deploy runners (e.g., PythonRunner and RRunner) 
2.      DataFrame Executors
3.      RDD operations? 

 

The last being questionable: integrating third party language extensions at
the RDD level may be too heavy-weight and unnecessary given the preference
towards the DataFrame abstraction. 

 

The main goals of this effort would be:

1.      Provide a clean abstraction for third party language extensions
making it easier to maintain (the language extension) with the evolution of
Apache Spark
2.      Provide guidance to third party language authors on how a language
extension should be implemented
3.      Provide general reusable libraries that are not specific to any
language extension
4.      Open the door to developers that prefer alternative languages

 

-Tyson Condie 

Reply via email to