[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

2019-03-15 Thread Tyson Condie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16793842#comment-16793842
 ] 

Tyson Condie commented on SPARK-26257:
--

Thank you [~abellina] [~srowen] [~zjffdu] for your interest in this SPIP. Our 
Microsoft team has been working on the C# bindings for Spark, and while doing 
this work, we are thinking hard about what a generic interop layer might look 
like. First, much of our C# bindings reuse existing Python interop support. 
This includes, the Catalyst rules that plan UDF physical operators (copied), 
establishing workers (Runners) for UDF execution, piping data to/from an 
external worker process, control channels in the driver that forward operations 
on C# proxy objects, all of which mirror existing Python and R interop 
mechanisms. [~imback82] (who is doing the bulk of work on the C# bindings) may 
have other examples of overlapping logic. 

As such, I think a single extensible interop layer makes sense for Spark. It 
would centralize some of the unit tests that are presently spread across Python 
and R extensions. And more importantly, it would reduce the code footprint in 
Spark. Beyond that, it sends a good message to other communities (.NET, Go, 
Haskell, Julia) that Spark is a programming language friendly platform for data 
analytics. 

As I mentioned earlier, we have been learning thus far, and we are now starting 
to collect our findings in a concrete design doc / proposal (that [~imback82] 
will be sharing soon) to the Spark community.

Thank you again for your interest! 

> SPIP: Interop Support for Spark Language Extensions
> ---
>
> Key: SPARK-26257
> URL: https://issues.apache.org/jira/browse/SPARK-26257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.0
>Reporter: Tyson Condie
>Priority: Minor
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some 
> notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated 
> interop layer. It would seem that much of that existing interop layer could 
> be refactored into a clean surface for general (third party) language 
> bindings, such as the above mentioned. More specifically, could we generalize 
> the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at 
> the RDD level may be too heavy-weight and unnecessary given the preference 
> towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it 
> easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language 
> extension should be implemented
>  * Provide general reusable libraries that are not specific to any language 
> extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in 
> languages other than what’s natively supported. Library developers will be 
> able to create language extensions for Spark in a clean way. The interop 
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it 
> aims to provide a stable interop layer for third party language extensions to 
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark 
> and R, specifically for the Dataframe API. For instance, it would be good to 
> have a general deploy.Runner (similar to PythonRunner) for language extension 
> efforts. In Spark SQL, it would be good to have a general InteropUDF and 
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; 
> depending on the success of the interop layer and with proper demand, RDD 
> interop could be added later. However, one open question is supporting a 
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a 

[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

2019-03-03 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782633#comment-16782633
 ] 

Jeff Zhang commented on SPARK-26257:


I know apache beam provide one abstraction layer for multiple language support, 
not sure whether the proposal here is related with that. Looking forward the 
design doc. [~tcondie]

> SPIP: Interop Support for Spark Language Extensions
> ---
>
> Key: SPARK-26257
> URL: https://issues.apache.org/jira/browse/SPARK-26257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.0
>Reporter: Tyson Condie
>Priority: Minor
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some 
> notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated 
> interop layer. It would seem that much of that existing interop layer could 
> be refactored into a clean surface for general (third party) language 
> bindings, such as the above mentioned. More specifically, could we generalize 
> the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at 
> the RDD level may be too heavy-weight and unnecessary given the preference 
> towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it 
> easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language 
> extension should be implemented
>  * Provide general reusable libraries that are not specific to any language 
> extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in 
> languages other than what’s natively supported. Library developers will be 
> able to create language extensions for Spark in a clean way. The interop 
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it 
> aims to provide a stable interop layer for third party language extensions to 
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark 
> and R, specifically for the Dataframe API. For instance, it would be good to 
> have a general deploy.Runner (similar to PythonRunner) for language extension 
> efforts. In Spark SQL, it would be good to have a general InteropUDF and 
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; 
> depending on the success of the interop layer and with proper demand, RDD 
> interop could be added later. However, one open question is supporting a 
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a driver/application, 
> running an interop UDF along with any other low-level transformations that 
> aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop 
> layer. This port should be contained solely to the Spark core side, and all 
> protocols specific to Python and R should not change e.g., Python should 
> continue to use py4j is the protocol between the Python process and core 
> Spark. The port itself should be contained to a handful of files e.g., some 
> examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, 
> PythonRDD (possibly), and will mostly involve refactoring common logic 
> abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a 
> third-party language extension to Spark do so directly; often by extending 
> existing Python classes and overriding the portions that are relevant to the 
> new extension. Not only is this not sound code (e.g., an JuliaRDD is not a 
> PythonRDD, which contains a lot of reusable code), but it runs the great risk 
> of future revisions making the subclass implementation obsolete. It would be 
> hard to 

[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782491#comment-16782491
 ] 

Sean Owen commented on SPARK-26257:
---

I'm not sure how much of the Pyspark and SparkR integration is meaningfully 
refactorable to support other languages. They both wrap Spark, effectively. I'm 
not sure therefore it's necessary or a good thing to further modify Spark for 
these things, if they already manage to support Python and R. With the 
exception of C#, those are niche languages anyway. I'd be open to seeing 
concrete proposals.

> SPIP: Interop Support for Spark Language Extensions
> ---
>
> Key: SPARK-26257
> URL: https://issues.apache.org/jira/browse/SPARK-26257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.0
>Reporter: Tyson Condie
>Priority: Minor
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some 
> notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated 
> interop layer. It would seem that much of that existing interop layer could 
> be refactored into a clean surface for general (third party) language 
> bindings, such as the above mentioned. More specifically, could we generalize 
> the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at 
> the RDD level may be too heavy-weight and unnecessary given the preference 
> towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it 
> easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language 
> extension should be implemented
>  * Provide general reusable libraries that are not specific to any language 
> extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in 
> languages other than what’s natively supported. Library developers will be 
> able to create language extensions for Spark in a clean way. The interop 
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it 
> aims to provide a stable interop layer for third party language extensions to 
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark 
> and R, specifically for the Dataframe API. For instance, it would be good to 
> have a general deploy.Runner (similar to PythonRunner) for language extension 
> efforts. In Spark SQL, it would be good to have a general InteropUDF and 
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; 
> depending on the success of the interop layer and with proper demand, RDD 
> interop could be added later. However, one open question is supporting a 
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a driver/application, 
> running an interop UDF along with any other low-level transformations that 
> aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop 
> layer. This port should be contained solely to the Spark core side, and all 
> protocols specific to Python and R should not change e.g., Python should 
> continue to use py4j is the protocol between the Python process and core 
> Spark. The port itself should be contained to a handful of files e.g., some 
> examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, 
> PythonRDD (possibly), and will mostly involve refactoring common logic 
> abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a 
> third-party language extension to Spark do so directly; often by extending 
> existing Python classes and overriding the portions that are relevant to the 
> new extension. Not only is 

[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

2019-01-14 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742519#comment-16742519
 ] 

Alessandro Bellina commented on SPARK-26257:


[~tcondie] I think you should share the design doc here too. It's a good idea, 
just curious on the details of what's being proposed.

> SPIP: Interop Support for Spark Language Extensions
> ---
>
> Key: SPARK-26257
> URL: https://issues.apache.org/jira/browse/SPARK-26257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.0
>Reporter: Tyson Condie
>Priority: Minor
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some 
> notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated 
> interop layer. It would seem that much of that existing interop layer could 
> be refactored into a clean surface for general (third party) language 
> bindings, such as the above mentioned. More specifically, could we generalize 
> the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at 
> the RDD level may be too heavy-weight and unnecessary given the preference 
> towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it 
> easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language 
> extension should be implemented
>  * Provide general reusable libraries that are not specific to any language 
> extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in 
> languages other than what’s natively supported. Library developers will be 
> able to create language extensions for Spark in a clean way. The interop 
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it 
> aims to provide a stable interop layer for third party language extensions to 
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark 
> and R, specifically for the Dataframe API. For instance, it would be good to 
> have a general deploy.Runner (similar to PythonRunner) for language extension 
> efforts. In Spark SQL, it would be good to have a general InteropUDF and 
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; 
> depending on the success of the interop layer and with proper demand, RDD 
> interop could be added later. However, one open question is supporting a 
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a driver/application, 
> running an interop UDF along with any other low-level transformations that 
> aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop 
> layer. This port should be contained solely to the Spark core side, and all 
> protocols specific to Python and R should not change e.g., Python should 
> continue to use py4j is the protocol between the Python process and core 
> Spark. The port itself should be contained to a handful of files e.g., some 
> examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, 
> PythonRDD (possibly), and will mostly involve refactoring common logic 
> abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a 
> third-party language extension to Spark do so directly; often by extending 
> existing Python classes and overriding the portions that are relevant to the 
> new extension. Not only is this not sound code (e.g., an JuliaRDD is not a 
> PythonRDD, which contains a lot of reusable code), but it runs the great risk 
> of future revisions making the subclass implementation obsolete. It would be 
> hard to imagine that any third-party 

[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

2019-01-03 Thread Tyson Condie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733297#comment-16733297
 ] 

Tyson Condie commented on SPARK-26257:
--

[~rxin] Any thoughts regarding this SPIP? I have a draft of a detailed design 
document describing what code changes could be made in order to facilitate a 
general purpose language interop API. Should that be shared here?

> SPIP: Interop Support for Spark Language Extensions
> ---
>
> Key: SPARK-26257
> URL: https://issues.apache.org/jira/browse/SPARK-26257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.0
>Reporter: Tyson Condie
>Priority: Minor
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some 
> notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated 
> interop layer. It would seem that much of that existing interop layer could 
> be refactored into a clean surface for general (third party) language 
> bindings, such as the above mentioned. More specifically, could we generalize 
> the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at 
> the RDD level may be too heavy-weight and unnecessary given the preference 
> towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it 
> easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language 
> extension should be implemented
>  * Provide general reusable libraries that are not specific to any language 
> extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in 
> languages other than what’s natively supported. Library developers will be 
> able to create language extensions for Spark in a clean way. The interop 
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it 
> aims to provide a stable interop layer for third party language extensions to 
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark 
> and R, specifically for the Dataframe API. For instance, it would be good to 
> have a general deploy.Runner (similar to PythonRunner) for language extension 
> efforts. In Spark SQL, it would be good to have a general InteropUDF and 
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; 
> depending on the success of the interop layer and with proper demand, RDD 
> interop could be added later. However, one open question is supporting a 
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a driver/application, 
> running an interop UDF along with any other low-level transformations that 
> aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop 
> layer. This port should be contained solely to the Spark core side, and all 
> protocols specific to Python and R should not change e.g., Python should 
> continue to use py4j is the protocol between the Python process and core 
> Spark. The port itself should be contained to a handful of files e.g., some 
> examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, 
> PythonRDD (possibly), and will mostly involve refactoring common logic 
> abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a 
> third-party language extension to Spark do so directly; often by extending 
> existing Python classes and overriding the portions that are relevant to the 
> new extension. Not only is this not sound code (e.g., an JuliaRDD is not a 
> PythonRDD, which contains a lot of reusable code), but it runs the great risk 
> of future revisions making the subclass