[ https://issues.apache.org/jira/browse/SPARK-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779595#comment-16779595 ]
Sean Owen commented on SPARK-27006: ----------------------------------- It'd be much, much better if this can, at least, start outside Spark. I don't think we can keep a whole other language and copy of the APIs and build up to date in the Spark project at this point. > SPIP: .NET bindings for Apache Spark > ------------------------------------ > > Key: SPARK-27006 > URL: https://issues.apache.org/jira/browse/SPARK-27006 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Terry Kim > Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 4,032h > Remaining Estimate: 4,032h > > h4. Background and Motivation: > Apache Spark provides programming language support for Scala/Java (native), > and extensions for Python and R. While a variety of other language extensions > are possible to include in Apache Spark, .NET would bring one of the largest > developer community to the table. Presently, no good Big Data solution exists > for .NET developers in open source. This SPIP aims at discussing how we can > bring Apache Spark goodness to the .NET development platform. > .NET is a free, cross-platform, open source developer platform for building > many different types of applications. With .NET, you can use multiple > languages, editors, and libraries to build for web, mobile, desktop, gaming, > and IoT types of applications. Even with .NET serving millions of developers, > there is no good Big Data solution that exists today, which this SPIP aims to > address. > The .NET developer community is one of the largest programming language > communities in the world. Its flagship programming language C# is listed as > one of the most popular programming languages in a variety of articles and > statistics: > * Most popular Technologies on Stack Overflow: > [https://insights.stackoverflow.com/survey/2018/#most-popular-technologies|https://insights.stackoverflow.com/survey/2018/] > > * Most popular languages on GitHub 2018: > [https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10#2-java-9|https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10] > > * 1M+ new developers last 1 year > * Second most demanded technology on LinkedIn > * Top 30 High velocity OSS projects on GitHub > Including a C# language extension in Apache Spark will enable millions of > .NET developers to author Big Data applications in their preferred > programming language, developer environment, and tooling support. We aim to > promote the .NET bindings for Spark through engagements with the Spark > community (e.g., we are scheduled to present an early prototype at the SF > Spark Summit 2019) and the .NET developer community (e.g., similar > presentations will be held at .NET developer conferences this year). As > such, we believe that our efforts will help grow the Spark community by > making it accessible to the millions of .NET developers. > Furthermore, our early discussions with some large .NET development teams got > an enthusiastic reception. > We recognize that earlier attempts at this goal (specifically Mobius > [https://github.com/Microsoft/Mobius]) were unsuccessful primarily due to the > lack of communication with the Spark community. Therefore, another goal of > this proposal is to not only develop .NET bindings for Spark in open source, > but also continuously seek feedback from the Spark community via posted > Jira’s (like this one) and the Spark developer mailing list. Our hope is that > through these engagements, we can build a community of developers that are > eager to contribute to this effort or want to leverage the resulting .NET > bindings for Spark in their respective Big Data applications. > h4. Target Personas: > .NET developers looking to build big data solutions. > h4. Goals: > Our primary goal is to help grow Apache Spark by making it accessible to the > large .NET developer base and ecosystem. We will also look for opportunities > to generalize the interop layers for Spark for adding other language > extensions in the future. [SPARK-26257]( > https://issues.apache.org/jira/browse/SPARK-26257) proposes such a > generalized interop layer, which we hope to address over the course of this > project. > Another important goal for us is to not only enable Spark as an application > solution for .NET developers, but also opening the door for .NET developers > to make contributions to Apache Spark itself. > Lastly, we aim to develop a .NET extension in the open, while continually > engaging with the Spark community for feedback on designs and code. We will > welcome PRs from the Spark community throughout this project and aim to grow > a community of developers that want to contribute to this project. > h4. Non-Goals: > This proposal is focused on adding .NET bindings to Apache Spark, and leave > any performance related tasks for future work. Further, we aim to provide > support only at the Dataframe level. > h4. Proposed API Changes: > This work mostly involves introducing new .NET binding APIs. For example, we > would introduce .NET UDF related classes such as DotnetUDF, > UserDefinedDotnetFunction, etc., along with classes responsible for running > .NET UDFs such as DotnetRunner, DotnetWorkerFactory, etc. > This work should have minimal impact on existing Spark APIs. However, in > order to provide a clean solution, we foresee the possibility of introducing > .NET specific hooks in the Dataset API for collecting data in the driver > program, for example. > We also will be introducing Catalyst rules that will plan the physical > operator (that we will introduce) for the DotnetUDF expression in the logical > plan. > On the C# side, similar to existing language extensions, we will introduce > proxy artifacts that mimic the SparkSession, Dataframe, and other APIs > related to Spark SQL e.g., column, functions native to Spark SQL, etc. > We will also look into augmenting the existing spark-submit and spark-shell > scripts with the ability to recognize a .NET environment. > h4. Optional Design Sketch: > Our design will largely follow the design of Python Spark support, including > how worker orchestration is performed (i.e., two-process solution, IPC > communication). As such, we will introduce “Runners” specific to executing > Dotnet driver and UDF workers. > h4. Optional Rejected Designs: > The clear alternative is the status quo; developers that want to leverage > Apache Spark do so through one of the existing supported languages i.e., > Scala/Java, Python, or R. This has some costly consequences, such as: > * Learning a new programming language and development environment. > * Integrating with existing .NET technologies through complex interop. > * Migrating legacy code and library dependencies to a supported language. > Another alternative is that third-party languages should only interact with > Spark via pure-SQL; possibly via REST. However, this does not enable UDFs or > UDAFs written in C#; a key desideratum in this effort, which most notably > takes the form of legacy code/UDFs that would need to be ported to a > supported language e.g., Scala. This exercise is extremely cumbersome and not > always feasible due to the code no longer being available i.e., only the > compiled library exists. As mentioned earlier, the .NET developer community > is one of the largest in the world, and as such there exist many instances of > legacy code (e.g., machine learning routines) that would be difficult to port > without the existing .NET library dependencies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org