This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
     new 444053f6adc [SPARK-42655][SQL] Incorrect ambiguous column reference 
error
444053f6adc is described below

commit 444053f6adc05bfe22359ccc825926428359c4a8
Author: Shrikant Prasad <shrpr...@visa.com>
AuthorDate: Tue Apr 4 21:15:57 2023 +0800

    [SPARK-42655][SQL] Incorrect ambiguous column reference error
    
    **What changes were proposed in this pull request?**
    The result of attribute resolution should consider only unique values for 
the reference. If it has duplicate values, it will incorrectly result into 
ambiguous reference error.
    
    **Why are the changes needed?**
    The below query fails incorrectly due to ambiguous reference error.
    val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
    val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
    val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
    df3.select("id").show()
    org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could 
be: id, id.
    
    df3.explain()
    == Physical Plan ==
    *(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS 
col4#20, _5#10 AS col5#21, _1#6 AS ID#17]
    
    Before the fix, attributes matched were:
    attributes: Vector(id#17, id#17)
    Thus, it throws ambiguous reference error. But if we consider only unique 
matches, it will return correct result.
    unique attributes: Vector(id#17)
    
    **Does this PR introduce any user-facing change?**
    Yes, Users migrating from Spark 2.3 to 3.x will face this error as the 
scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, 
iit will work correctly as it was in Spark 2.3.
    
    **How was this patch tested?**
    Added unit test.
    
    Closes #40258 from shrprasa/col_ambiguous_issue.
    
    Authored-by: Shrikant Prasad <shrpr...@visa.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
    (cherry picked from commit b283c6a0e47c3292dbf00400392b3a3f629dd965)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
---
 .../org/apache/spark/sql/catalyst/expressions/package.scala      | 2 +-
 .../src/test/scala/org/apache/spark/sql/DataFrameSuite.scala     | 9 +++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
index 74f0875c285..67936c36b41 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
@@ -342,7 +342,7 @@ package object expressions  {
       // attribute metadata to indicate that they are from metadata columns, 
but they should not
       // keep any restrictions that may break column resolution for normal 
attributes.
       // See SPARK-42084 for more details.
-      prunedCandidates.map(_.markAsAllowAnyAccess()) match {
+      prunedCandidates.distinct.map(_.markAsAllowAnyAccess()) match {
         case Seq(a) if nestedFields.nonEmpty =>
           // One match, but we also need to extract the requested nested field.
           // The foldLeft adds ExtractValues for every remaining parts of the 
identifier,
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index bf8d7816e47..b43b8b1080c 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -2756,6 +2756,15 @@ class DataFrameSuite extends QueryTest
     checkAnswer(swappedDf.filter($"key"($"map") > "a"), Row(2, Map(2 -> "b")))
   }
 
+  test("SPARK-42655 Fix ambiguous column reference error") {
+    val df1 = sparkContext.parallelize(List((1, 2, 3, 4, 5))).toDF("id", 
"col2", "col3",
+      "col4", "col5")
+    val op_cols_mixed_case = List("id", "col2", "col3", "col4", "col5", "ID")
+    val df2 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
+    // should not throw any error.
+    checkAnswer(df2.select("id"), Row(1))
+  }
+
   test("SPARK-26057: attribute deduplication on already analyzed plans") {
     withTempView("a", "b", "v") {
       val df1 = Seq(("1-1", 6)).toDF("id", "n")


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to