[ 
https://issues.apache.org/jira/browse/SPARK-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779595#comment-16779595
 ] 

Sean Owen commented on SPARK-27006:
-----------------------------------

It'd be much, much better if this can, at least, start outside Spark. I don't 
think we can keep a whole other language and copy of the APIs and build up to 
date in the Spark project at this point.

> SPIP: .NET bindings for Apache Spark
> ------------------------------------
>
>                 Key: SPARK-27006
>                 URL: https://issues.apache.org/jira/browse/SPARK-27006
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Terry Kim
>            Priority: Minor
>             Fix For: 3.0.0
>
>   Original Estimate: 4,032h
>  Remaining Estimate: 4,032h
>
> h4. Background and Motivation: 
> Apache Spark provides programming language support for Scala/Java (native), 
> and extensions for Python and R. While a variety of other language extensions 
> are possible to include in Apache Spark, .NET would bring one of the largest 
> developer community to the table. Presently, no good Big Data solution exists 
> for .NET developers in open source.  This SPIP aims at discussing how we can 
> bring Apache Spark goodness to the .NET development platform.  
> .NET is a free, cross-platform, open source developer platform for building 
> many different types of applications. With .NET, you can use multiple 
> languages, editors, and libraries to build for web, mobile, desktop, gaming, 
> and IoT types of applications. Even with .NET serving millions of developers, 
> there is no good Big Data solution that exists today, which this SPIP aims to 
> address.  
> The .NET developer community is one of the largest programming language 
> communities in the world. Its flagship programming language C# is listed as 
> one of the most popular programming languages in a variety of articles and 
> statistics: 
>  * Most popular Technologies on Stack Overflow: 
> [https://insights.stackoverflow.com/survey/2018/#most-popular-technologies|https://insights.stackoverflow.com/survey/2018/]
>   
>  * Most popular languages on GitHub 2018: 
> [https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10#2-java-9|https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10]
>  
>  * 1M+ new developers last 1 year  
>  * Second most demanded technology on LinkedIn 
>  * Top 30 High velocity OSS projects on GitHub 
> Including a C# language extension in Apache Spark will enable millions of 
> .NET developers to author Big Data applications in their preferred 
> programming language, developer environment, and tooling support. We aim to 
> promote the .NET bindings for Spark through engagements with the Spark 
> community (e.g., we are scheduled to present an early prototype at the SF 
> Spark Summit 2019) and the .NET developer community (e.g., similar 
> presentations will be held at .NET developer conferences this year).  As 
> such, we believe that our efforts will help grow the Spark community by 
> making it accessible to the millions of .NET developers. 
> Furthermore, our early discussions with some large .NET development teams got 
> an enthusiastic reception. 
> We recognize that earlier attempts at this goal (specifically Mobius 
> [https://github.com/Microsoft/Mobius]) were unsuccessful primarily due to the 
> lack of communication with the Spark community. Therefore, another goal of 
> this proposal is to not only develop .NET bindings for Spark in open source, 
> but also continuously seek feedback from the Spark community via posted 
> Jira’s (like this one) and the Spark developer mailing list. Our hope is that 
> through these engagements, we can build a community of developers that are 
> eager to contribute to this effort or want to leverage the resulting .NET 
> bindings for Spark in their respective Big Data applications. 
> h4. Target Personas: 
> .NET developers looking to build big data solutions.  
> h4. Goals: 
> Our primary goal is to help grow Apache Spark by making it accessible to the 
> large .NET developer base and ecosystem. We will also look for opportunities 
> to generalize the interop layers for Spark for adding other language 
> extensions in the future. [SPARK-26257]( 
> https://issues.apache.org/jira/browse/SPARK-26257) proposes such a 
> generalized interop layer, which we hope to address over the course of this 
> project.  
> Another important goal for us is to not only enable Spark as an application 
> solution for .NET developers, but also opening the door for .NET developers 
> to make contributions to Apache Spark itself.   
> Lastly, we aim to develop a .NET extension in the open, while continually 
> engaging with the Spark community for feedback on designs and code. We will 
> welcome PRs from the Spark community throughout this project and aim to grow 
> a community of developers that want to contribute to this project.  
> h4. Non-Goals: 
> This proposal is focused on adding .NET bindings to Apache Spark, and leave 
> any performance related tasks for future work. Further, we aim to provide 
> support only at the Dataframe level. 
> h4. Proposed API Changes: 
> This work mostly involves introducing new .NET binding APIs. For example, we 
> would introduce .NET UDF related classes such as DotnetUDF, 
> UserDefinedDotnetFunction, etc., along with classes responsible for running 
> .NET UDFs such as DotnetRunner, DotnetWorkerFactory, etc. 
> This work should have minimal impact on existing Spark APIs. However, in 
> order to provide a clean solution, we foresee the possibility of introducing 
> .NET specific hooks in the Dataset API for collecting data in the driver 
> program, for example. 
> We also will be introducing Catalyst rules that will plan the physical 
> operator (that we will introduce) for the DotnetUDF expression in the logical 
> plan. 
> On the C# side, similar to existing language extensions, we will introduce 
> proxy artifacts that mimic the SparkSession, Dataframe, and other APIs 
> related to Spark SQL e.g., column, functions native to Spark SQL, etc.    
> We will also look into augmenting the existing spark-submit and spark-shell 
> scripts with the ability to recognize a .NET environment.  
> h4. Optional Design Sketch: 
> Our design will largely follow the design of Python Spark support, including 
> how worker orchestration is performed (i.e., two-process solution, IPC 
> communication). As such, we will introduce “Runners” specific to executing 
> Dotnet driver and UDF workers.  
> h4. Optional Rejected Designs: 
> The clear alternative is the status quo; developers that want to leverage 
> Apache Spark do so through one of the existing supported languages i.e., 
> Scala/Java, Python, or R. This has some costly consequences, such as: 
>  * Learning a new programming language and development environment. 
>  * Integrating with existing .NET technologies through complex interop. 
>  * Migrating legacy code and library dependencies to a supported language. 
> Another alternative is that third-party languages should only interact with 
> Spark via pure-SQL; possibly via REST. However, this does not enable UDFs or 
> UDAFs written in C#; a key desideratum in this effort, which most notably 
> takes the form of legacy code/UDFs that would need to be ported to a 
> supported language e.g., Scala. This exercise is extremely cumbersome and not 
> always feasible due to the code no longer being available i.e., only the 
> compiled library exists. As mentioned earlier, the .NET developer community 
> is one of the largest in the world, and as such there exist many instances of 
> legacy code (e.g., machine learning routines) that would be difficult to port 
> without the existing .NET library dependencies. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to