[jira] [Commented] (SPARK-45170) Scala-specific improvements in Dataset[T] API

PEIYUAN SUN (Jira) Sat, 28 Oct 2023 08:38:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-45170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780627#comment-17780627
 ]


PEIYUAN SUN commented on SPARK-45170:
-------------------------------------

What is the difference between this and the 
[https://typelevel.org/frameless/FeatureOverview.html] ?

> Scala-specific improvements in Dataset[T] API 
> ----------------------------------------------
>
>                 Key: SPARK-45170
>                 URL: https://issues.apache.org/jira/browse/SPARK-45170
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.4.1
>            Reporter: Danila Goloshchapov
>            Priority: Minor
>              Labels: SPIP
>
> *Q1.* What are you trying to do? 
> The main idea is to use the power of scala's macrosses to give developers 
> more convenient and typesafe API to use in join conditions. 
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> R/Java/Python/DataFrame API is out of scope. The solution is not affecting 
> plan generation too. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently the join condition is specified via strings, which might lead to 
> silly mistakes (typos, incompatible column types etc) and sometimes hard to 
> read (in case when several joins are made and the final type is tuple of 
> tuple of tuples...)
>  
> *Q4.* What is new in your approach and why do you think it will be successful?
> Scala macroses can be used to extract the column name directly from lambda 
> (extractor). As a side effect its possible to check the column type and 
> prohibit to build inconsistent join expression (like boolean-timestamp 
> comparison)
>  
> *Q5.* Who cares? If you are successful, what difference will it make?
> Mainly scala developers who prefers typesafe code - they would have a more 
> clean and nice API that will make the codebase a bit clearer, especially in 
> case when several chained joins is used
>  
> *Q6.* What are the risks?
> The overusage of macrosses may slow down the compilation speed. In additional 
> macrosses are hard to maintain
>  
> *Q7.* How long will it take?
> Currently the approach is already implemented as a separate 
> [lib|https://github.com/Salamahin/joinwiz] that makes a bit more than just 
> gives alternative API (for example abstracts Dataset[T] to F[T] which allows 
> to run some spark-specific code without spark session for testing purposes)
> Adaptation of it won't be a hard job, matter of several weeks
>  
> *Q8.* What are the mid-term and final “exams” to check for success?
> API convenience is very hard to estimate as its more or less a question of 
> taste
>  
> *Appendix A*
> You may find the examples of such 'cleaner' API 
> [here|https://github.com/Salamahin/joinwiz/blob/master/joinwiz_core/src/test/scala/joinwiz/ComputationEngineTest.scala]
> Note that backward and forward compatibility is achieved by introducing a 
> brand-new API without modifying an old one
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45170) Scala-specific improvements in Dataset[T] API

Reply via email to