[spark] branch branch-3.0 updated: [SPARK-37779][SQL] Make ColumnarToRowExec plan canonicalizable after (de)serialization

gurwls223 Wed, 29 Dec 2021 19:43:16 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 8ab7cd3  [SPARK-37779][SQL] Make ColumnarToRowExec plan 
canonicalizable after (de)serialization
8ab7cd3 is described below

commit 8ab7cd3ca7e828e114218ae811a9afebb5c9bcc7
Author: Hyukjin Kwon <gurwls...@apache.org>
AuthorDate: Thu Dec 30 12:38:37 2021 +0900

    [SPARK-37779][SQL] Make ColumnarToRowExec plan canonicalizable after 
(de)serialization
    
    This PR proposes to add a driver-side check on `supportsColumnar` sanity 
check at `ColumnarToRowExec`.
    
    SPARK-23731 fixed the plans to be serializable by leveraging lazy but 
SPARK-28213 happened to refer to the lazy variable at: 
https://github.com/apache/spark/blob/77b164aac9764049a4820064421ef82ec0bc14fb/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L68
    
    This can fail during canonicalization during, for example, eliminating sub 
common expressions (on executor side):
    
    ```
    java.lang.NullPointerException
        at 
org.apache.spark.sql.execution.FileSourceScanExec.supportsColumnar$lzycompute(DataSourceScanExec.scala:280)
        at 
org.apache.spark.sql.execution.FileSourceScanExec.supportsColumnar(DataSourceScanExec.scala:279)
        at 
org.apache.spark.sql.execution.InputAdapter.supportsColumnar(WholeStageCodegenExec.scala:509)
        at 
org.apache.spark.sql.execution.ColumnarToRowExec.<init>(Columnar.scala:67)
        ...
        at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:581)
        at 
org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:580)
        at 
org.apache.spark.sql.execution.ScalarSubquery.canonicalized$lzycompute(subquery.scala:110)
        ...
        at 
org.apache.spark.sql.catalyst.expressions.ExpressionEquals.hashCode(EquivalentExpressions.scala:275)
        ...
        at scala.collection.mutable.HashTable.findEntry$(HashTable.scala:135)
        at scala.collection.mutable.HashMap.findEntry(HashMap.scala:44)
        at scala.collection.mutable.HashMap.get(HashMap.scala:74)
        at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:46)
        at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTreeHelper$1(EquivalentExpressions.scala:147)
        at 
org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:170)
        at 
org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.$anonfun$proxyExpressions$1(SubExprEvaluationRuntime.scala:89)
        at 
org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.$anonfun$proxyExpressions$1$adapted(SubExprEvaluationRuntime.scala:89)
        at scala.collection.immutable.List.foreach(List.scala:392)
    ```
    
    This fix is still a bandaid fix but at least addresses the issue with 
minimized change - this fix should ideally be backported too.
    
    Pretty unlikely - when `ColumnarToRowExec` has to be canonicalized on the 
executor side (see the stacktrace), but yes. it would fix a bug.
    
    Unittest was added.
    
    Closes #35058 from HyukjinKwon/SPARK-37779.
    
    Authored-by: Hyukjin Kwon <gurwls...@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
    (cherry picked from commit 195f1aaf4361fb8f5f31ef7f5c63464767ad88bd)
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 .../org/apache/spark/sql/execution/Columnar.scala     |  3 ++-
 .../apache/spark/sql/execution/SparkPlanSuite.scala   | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala
index e01cd85..2fe046b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala
@@ -58,7 +58,8 @@ class ColumnarRule {
  * [[MapPartitionsInRWithArrowExec]]. Eventually this should replace those 
implementations.
  */
 case class ColumnarToRowExec(child: SparkPlan) extends UnaryExecNode with 
CodegenSupport {
-  assert(child.supportsColumnar)
+  // supportsColumnar requires to be only called on driver side, see also 
SPARK-37779.
+  assert(TaskContext.get != null || child.supportsColumnar)
 
   override def output: Seq[Attribute] = child.output
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala
index 56fff11..dacf8fe 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala
@@ -88,4 +88,23 @@ class SparkPlanSuite extends QueryTest with 
SharedSparkSession {
   test("SPARK-30780 empty LocalTableScan should use RDD without partitions") {
     assert(LocalTableScanExec(Nil, Nil).execute().getNumPartitions == 0)
   }
+
+  test("SPARK-37779: ColumnarToRowExec should be canonicalizable after being 
(de)serialized") {
+    withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
+      withTempPath { path =>
+        spark.range(1).write.parquet(path.getAbsolutePath)
+        val df = spark.read.parquet(path.getAbsolutePath)
+        val columnarToRowExec =
+          df.queryExecution.executedPlan.collectFirst { case p: 
ColumnarToRowExec => p }.get
+        try {
+          spark.range(1).foreach { _ =>
+            columnarToRowExec.canonicalized
+            ()
+          }
+        } catch {
+          case e: Throwable => fail("ColumnarToRowExec was not 
canonicalizable", e)
+        }
+      }
+    }
+  }
 }

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-37779][SQL] Make ColumnarToRowExec plan canonicalizable after (de)serialization

Reply via email to