[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30218:
----------------------------------
    Affects Version/s: 2.4.2
                       2.4.3
                       2.4.4

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30218
>                 URL: https://issues.apache.org/jira/browse/SPARK-30218
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.1, 2.4.2, 2.4.3, 2.4.4
>            Reporter: Francesco Cavrini
>            Priority: Major
>              Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|        0|id1|   B|        1|
> |id1|   A|        0|id1|   B|        5|
> |id1|   A|        1|id1|   B|        1|
> |id1|   A|        1|id1|   B|        5|
> |id2|   A|        2|id2|   B|       10|
> |id2|   A|        3|id2|   B|       10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)))))
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> :     +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>    +- Filter (kind#37 = B)
>       +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to