[ https://issues.apache.org/jira/browse/SPARK-35652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-35652: ---------------------------------- Fix Version/s: (was: 3.0.3) 3.0.4 > Different Behaviour join vs joinWith in self joining > ---------------------------------------------------- > > Key: SPARK-35652 > URL: https://issues.apache.org/jira/browse/SPARK-35652 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.2 > Environment: {color:#172b4d}Spark 3.1.2{color} > Scala 2.12 > Reporter: Wassim Almaaoui > Assignee: dgd_contributor > Priority: Critical > Fix For: 3.2.0, 3.1.3, 3.0.4 > > > It seems like spark inner join is performing a cartesian join in self joining > using `joinWith` and an inner join using `join` > Snippet: > > {code:java} > scala> val df = spark.range(0,5) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> df.show > +---+ > | id| > +---+ > | 0| > | 1| > | 2| > | 3| > | 4| > +---+ > scala> df.join(df, df("id") === df("id")).count > 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, > 'id#1649L = id#1649L'. Perhaps you need to use aliases. > res21: Long = 5 > scala> df.joinWith(df, df("id") === df("id")).count > 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, > 'id#1649L = id#1649L'. Perhaps you need to use aliases. > res22: Long = 25 > {code} > According to the comment in code source, joinWith is expected to manage this > case, right? > {code:java} > def joinWith[U](other: Dataset[U], condition: Column, joinType: String): > Dataset[(T, U)] = { > // Creates a Join node and resolve it first, to get join condition > resolved, self-join resolved, > // etc. > {code} > I find it weird that join and joinWith haven't the same behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org