Bruce Robbins created SPARK-52339:
-------------------------------------

             Summary: Relations may appear equal even though they are different
                 Key: SPARK-52339
                 URL: https://issues.apache.org/jira/browse/SPARK-52339
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.0, 3.5.6
            Reporter: Bruce Robbins


For example:
{noformat}
// create test data
val data = Seq((1, 2), (2, 3)).toDF("a", "b")
data.write.mode("overwrite").csv("/tmp/test")

val fileList1 = List.fill(2)("/tmp/test")
val fileList2 = List.fill(3)("/tmp/test")

val df1 = spark.read.schema("a int, b int").csv(fileList1: _*)
val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)

df1.count() // correctly returns 4
df2.count() // correctly returns 6

// the following is the same as above, except df1 is persisted
val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist
val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)

df1.count() // correctly returns 4
df2.count() // incorrectly returns 4!!
{noformat}
df1 and df2 were created with a different number of paths: df1 has 2, and df2 
has 3. But since the distinct set of root paths is the same (e.g., 
{{Set("/tmp/test") == Set("/tmp/test"))}}, the two dataframes are considered 
equal. Thus, when df1 is persisted, df2 uses df1's cached plan.

The same bug also causes inappropriate exchange reuse.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to