[jira] [Created] (SPARK-45170) Scala-specific improvements in Dataset[T] API

Danila Goloshchapov (Jira) Thu, 14 Sep 2023 08:14:05 -0700

Danila Goloshchapov created SPARK-45170:
-------------------------------------------


             Summary: Scala-specific improvements in Dataset[T] API 
                 Key: SPARK-45170
                 URL: https://issues.apache.org/jira/browse/SPARK-45170
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.4.1
            Reporter: Danila Goloshchapov


*Q1.* What are you trying to do? 

The main idea is to use the power of scala's macrosses to give developers more 
convenient and typesafe API to use in join conditions. 

 

*Q2.* What problem is this proposal NOT designed to solve?

R/Java/Python/DataFrame API is out of scope. The solution is not affecting plan 
generation too. 

 

*Q3.* How is it done today, and what are the limits of current practice?

Currently the join condition is specified via strings, which might lead to 
silly mistakes (typos, incompatible column types etc) and sometimes hard to 
read (in case when several joins are made and the final type is tuple of tuple 
of tuples...)

 

*Q4.* What is new in your approach and why do you think it will be successful?

Scala macroses can be used to extract the column name directly from lambda 
(extractor). As a side effect its possible to check the column type and 
prohibit to build inconsistent join expression (like boolean-timestamp 
comparison)

 

*Q5.* Who cares? If you are successful, what difference will it make?

Mainly scala developers who prefers typesafe code - they would have a more 
clean and nice API that will make the codebase a bit clearer, especially in 
case when several chained joins is used

 

*Q6.* What are the risks?

The overusage of macrosses may slow down the compilation speed. In additional 
macrosses are hard to maintain

 

*Q7.* How long will it take?

Currently the approach is already implemented as a separate 
[lib|https://github.com/Salamahin/joinwiz] that makes a bit more than just 
gives alternative API (for example abstracts Dataset[T] to F[T] which allows to 
run some spark-specific code without spark session for testing purposes)

Adaptation of it won't be a hard job, matter of several weeks

 

*Q8.* What are the mid-term and final “exams” to check for success?

API convenience is very hard to estimate as its more or less a question of taste

 

*Appendix A*

You may find the examples of such 'cleaner' API 
[here|https://github.com/Salamahin/joinwiz/blob/master/joinwiz_core/src/test/scala/joinwiz/ComputationEngineTest.scala]

Note that backward and forward compatibility is achieved by introducing a 
brand-new API without modifying an old one

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45170) Scala-specific improvements in Dataset[T] API

Reply via email to