[ 
https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35652:
----------------------------------
    Fix Version/s:     (was: 3.0.3)
                   3.0.4

> Different Behaviour join vs joinWith in self joining
> ----------------------------------------------------
>
>                 Key: SPARK-35652
>                 URL: https://issues.apache.org/jira/browse/SPARK-35652
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2
>         Environment: {color:#172b4d}Spark 3.1.2{color}
> Scala 2.12
>            Reporter: Wassim Almaaoui
>            Assignee: dgd_contributor
>            Priority: Critical
>             Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> It seems like spark inner join is performing a cartesian join in self joining 
> using `joinWith` and an inner join using `join` 
> Snippet:
>  
> {code:java}
> scala> val df = spark.range(0,5) 
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> df.show 
> +---+ 
> | id| 
> +---+ 
> | 0|
> | 1| 
> | 2| 
> | 3| 
> | 4| 
> +---+ 
> scala> df.join(df, df("id") === df("id")).count 
> 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, 
> 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
> res21: Long = 5
> scala> df.joinWith(df, df("id") === df("id")).count
> 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, 
> 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
> res22: Long = 25
> {code}
> According to the comment in code source, joinWith is expected to manage this 
> case, right?
> {code:java}
> def joinWith[U](other: Dataset[U], condition: Column, joinType: String): 
> Dataset[(T, U)] = {
>     // Creates a Join node and resolve it first, to get join condition 
> resolved, self-join resolved,
>     // etc.
> {code}
> I find it weird that join and joinWith haven't the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to