Terry Kim created SPARK-27006:
---------------------------------

             Summary: SPIP: .NET bindings for Apache Spark
                 Key: SPARK-27006
                 URL: https://issues.apache.org/jira/browse/SPARK-27006
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: Terry Kim
             Fix For: 3.0.0


h4. Background and Motivation: 

Apache Spark provides programming language support for Scala/Java (native), and 
extensions for Python and R. While a variety of other language extensions are 
possible to include in Apache Spark, .NET would bring one of the largest 
developer community to the table. Presently, no good Big Data solution exists 
for .NET developers in open source.  This SPIP aims at discussing how we can 
bring Apache Spark goodness to the .NET development platform.  

.NET is a free, cross-platform, open source developer platform for building 
many different types of applications. With .NET, you can use multiple 
languages, editors, and libraries to build for web, mobile, desktop, gaming, 
and IoT types of applications. Even with .NET serving millions of developers, 
there is no good Big Data solution that exists today, which this SPIP aims to 
address.  

The .NET developer community is one of the largest programming language 
communities in the world. Its flagship programming language C# is listed as one 
of the most popular programming languages in a variety of articles and 
statistics: 
 * Most popular Technologies on Stack Overflow: 
[https://insights.stackoverflow.com/survey/2018/#most-popular-technologies|https://insights.stackoverflow.com/survey/2018/]
  
 * Most popular languages on GitHub 2018: 
[https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10#2-java-9|https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10]
 
 * 1M+ new developers last 1 year  
 * Second most demanded technology on LinkedIn 
 * Top 30 High velocity OSS projects on GitHub 

Including a C# language extension in Apache Spark will enable millions of .NET 
developers to author Big Data applications in their preferred programming 
language, developer environment, and tooling support. We aim to promote the 
.NET bindings for Spark through engagements with the Spark community (e.g., we 
are scheduled to present an early prototype at the SF Spark Summit 2019) and 
the .NET developer community (e.g., similar presentations will be held at .NET 
developer conferences this year).  As such, we believe that our efforts will 
help grow the Spark community by making it accessible to the millions of .NET 
developers. 

Furthermore, our early discussions with some large .NET development teams got 
an enthusiastic reception. 

We recognize that earlier attempts at this goal (specifically Mobius 
[https://github.com/Microsoft/Mobius]) were unsuccessful primarily due to the 
lack of communication with the Spark community. Therefore, another goal of this 
proposal is to not only develop .NET bindings for Spark in open source, but 
also continuously seek feedback from the Spark community via posted Jira’s 
(like this one) and the Spark developer mailing list. Our hope is that through 
these engagements, we can build a community of developers that are eager to 
contribute to this effort or want to leverage the resulting .NET bindings for 
Spark in their respective Big Data applications. 
h4. Target Personas: 

.NET developers looking to build big data solutions.  
h4. Goals: 

Our primary goal is to help grow Apache Spark by making it accessible to the 
large .NET developer base and ecosystem. We will also look for opportunities to 
generalize the interop layers for Spark for adding other language extensions in 
the future. [SPARK-26257]( https://issues.apache.org/jira/browse/SPARK-26257) 
proposes such a generalized interop layer, which we hope to address over the 
course of this project.  

Another important goal for us is to not only enable Spark as an application 
solution for .NET developers, but also opening the door for .NET developers to 
make contributions to Apache Spark itself.   

Lastly, we aim to develop a .NET extension in the open, while continually 
engaging with the Spark community for feedback on designs and code. We will 
welcome PRs from the Spark community throughout this project and aim to grow a 
community of developers that want to contribute to this project.  
h4. Non-Goals: 

This proposal is focused on adding .NET bindings to Apache Spark, and leave any 
performance related tasks for future work. Further, we aim to provide support 
only at the Dataframe level. 
h4. Proposed API Changes: 

This work mostly involves introducing new .NET binding APIs. For example, we 
would introduce .NET UDF related classes such as DotnetUDF, 
UserDefinedDotnetFunction, etc., along with classes responsible for running 
.NET UDFs such as DotnetRunner, DotnetWorkerFactory, etc. 

This work should have minimal impact on existing Spark APIs. However, in order 
to provide a clean solution, we foresee the possibility of introducing .NET 
specific hooks in the Dataset API for collecting data in the driver program, 
for example. 

We also will be introducing Catalyst rules that will plan the physical operator 
(that we will introduce) for the DotnetUDF expression in the logical plan. 

On the C# side, similar to existing language extensions, we will introduce 
proxy artifacts that mimic the SparkSession, Dataframe, and other APIs related 
to Spark SQL e.g., column, functions native to Spark SQL, etc.    

We will also look into augmenting the existing spark-submit and spark-shell 
scripts with the ability to recognize a .NET environment.  
h4. Optional Design Sketch: 

Our design will largely follow the design of Python Spark support, including 
how worker orchestration is performed (i.e., two-process solution, IPC 
communication). As such, we will introduce “Runners” specific to executing 
Dotnet driver and UDF workers.  
h4. Optional Rejected Designs: 

The clear alternative is the status quo; developers that want to leverage 
Apache Spark do so through one of the existing supported languages i.e., 
Scala/Java, Python, or R. This has some costly consequences, such as: 
 * Learning a new programming language and development environment. 
 * Integrating with existing .NET technologies through complex interop. 
 * Migrating legacy code and library dependencies to a supported language. 

Another alternative is that third-party languages should only interact with 
Spark via pure-SQL; possibly via REST. However, this does not enable UDFs or 
UDAFs written in C#; a key desideratum in this effort, which most notably takes 
the form of legacy code/UDFs that would need to be ported to a supported 
language e.g., Scala. This exercise is extremely cumbersome and not always 
feasible due to the code no longer being available i.e., only the compiled 
library exists. As mentioned earlier, the .NET developer community is one of 
the largest in the world, and as such there exist many instances of legacy code 
(e.g., machine learning routines) that would be difficult to port without the 
existing .NET library dependencies. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to