Francesco Cavrini created SPARK-30218:
-----------------------------------------

             Summary: Columns used in inequality conditions for joins not 
resolved correctly in case of common lineage
                 Key: SPARK-30218
                 URL: https://issues.apache.org/jira/browse/SPARK-30218
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.1
            Reporter: Francesco Cavrini


When columns from different data-frames that have a common lineage are used in 
inequality conditions in joins, they are not resolved correctly. In particular, 
both the column from the left DF and the one from the right DF are resolved to 
the same column, thus making the inequality condition either always satisfied 
or always not-satisfied.

Minimal example to reproduce follows.

{code:python}
import pyspark.sql.functions as F

data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
["id", "kind", "timestamp"])

df_left = data.where(F.col("kind") == "A").alias("left")
df_right = data.where(F.col("kind") == "B").alias("right")

conds = [df_left["id"] == df_right["id"]]
conds.append(df_right["timestamp"].between(df_left["timestamp"], 
df_left["timestamp"] + 2))

res = df_left.join(df_right, conds, how="left")
{code}

The result is:

| id|kind|timestamp| id|kind|timestamp|
|id1|   A|        0|id1|   B|        1|
|id1|   A|        0|id1|   B|        5|
|id1|   A|        1|id1|   B|        1|
|id1|   A|        1|id1|   B|        5|
|id2|   A|        2|id2|   B|       10|
|id2|   A|        3|id2|   B|       10|

which violates the condition that the timestamp from the right DF should be 
between df_left["timestamp"] and  df_left["timestamp"] + 2.

The plan shows the problem in the column resolution.

{code:bash}
== Parsed Logical Plan ==
Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
(timestamp#2L <= (timestamp#2L + cast(2 as bigint)))))
:- SubqueryAlias `left`
:  +- Filter (kind#1 = A)
:     +- LogicalRDD [id#0, kind#1, timestamp#2L], false
+- SubqueryAlias `right`
   +- Filter (kind#37 = B)
      +- LogicalRDD [id#36, kind#37, timestamp#38L], false
{code}

Note, the columns used in the equality condition of the join have been 
correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to