[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021346#comment-17021346 ]
Dongjoon Hyun commented on SPARK-30218: --------------------------------------- As [~cloud_fan] mentioned, SPARK-28344 throws `AnalysisException` for this query. So, I'm closing this as a duplicate of SPARK-28344. {code} >>> spark.version '3.0.0-preview2' pyspark.sql.utils.AnalysisException: Column timestamp#2L, timestamp#2L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via `Dataset.as` before joining them, and specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin.enabled to false to disable this check.; {code} > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > ------------------------------------------------------------------------------------------------ > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.4, 2.4.4 > Reporter: Francesco Cavrini > Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A| 0|id1| B| 1| > |id1| A| 0|id1| B| 5| > |id1| A| 1|id1| B| 1| > |id1| A| 1|id1| B| 5| > |id2| A| 2|id2| B| 10| > |id2| A| 3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint))))) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` > +- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org