[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371543#comment-17371543
 ] 

Jolan Rensen commented on SPARK-32530:
--------------------------------------

I've been helping with the Jetbrains' [Kotlin Spark 
API|https://github.com/JetBrains/kotlin-spark-api] for the last couple of 
months. The API is actually helping me a lot in data analysis for writing my 
Thesis. Personally, I'm not very familiar with Scala, and using Java with Spark 
introduces way too much boilerplate code, in my honest opinion. Kotlin brings a 
straightforward and readable flavor to Spark, which I really like. I most like 
the support for Kotlin's data classes since I mostly use typed Datasets, and in 
Java, having to create an entire class with two constructors, getters, setters, 
and hashcode/equals functions takes too much time.

While most of the Kotlin wrapper functions in de API can be created using 
extension functions relatively easily, a couple of functions prove to be 
difficult to use from Kotlin due to how the Apache Spark API is built. These 
are the functions, for instance, in the Dataset class, where there exist Java- 
and Scala-specific variants.
 Kotlin has [SAM 
conversions|https://kotlinlang.org/docs/java-interop.html#sam-conversions], 
where interfaces that only have one function, like ReduceFunction but also 
scala.Function2, can be instantiated using a lambda function. What we thus 
expect to be able to do is: "myDataset.reduce \{ a, b -> a + b }". However, 
both the Scala and the Java variant of the reduce function match, given this 
statement (and any other statement featuring a similar double function), so the 
compilation fails. 
 This problem also cannot be solved using an extension function in the Kotlin 
Spark API since functions inside a class always have priority over extension 
functions. The only solution currently on our side is to rename the extension 
function to something like "reduceK," but this, of course, is suboptimal.
 A possible solution from your side could be to hide one of the two functions 
from Kotlin using a [Kotlin-specific Deprecation 
annotation|https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-deprecated/] at 
the HIDDEN level. This would resolve the overload resolution ambiguity. Of 
course, I understand if you don't want to include the Kotlin standard library 
into Spark for this, but I'm sure there are more solutions like it. It might 
even be possible to be solved neatly on our side, so let us know if there's 
something we haven't tried yet!

If Kotlin support is properly added to Spark, this issue should also be taken 
care of :)

> SPIP: Kotlin support for Apache Spark
> -------------------------------------
>
>                 Key: SPARK-32530
>                 URL: https://issues.apache.org/jira/browse/SPARK-32530
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.1
>            Reporter: Pasha Finkeshteyn
>            Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How long will it take?
> A  working implementation is already available, and if the community will 
> have any proposal of changes for this implementation to be improved, these 
> can be implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to