[spark] branch branch-3.3 updated: [SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method

wenchen Wed, 26 Apr 2023 02:26:07 -0700

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.3 by this push:
     new 058dcbf3fb0 [SPARK-43240][SQL][3.3] Fix the wrong result issue when 
calling df.describe() method
058dcbf3fb0 is described below

commit 058dcbf3fb0b17a4295f6e0b516f5c955cfa2d59
Author: Jia Ke <ke.a....@intel.com>
AuthorDate: Wed Apr 26 17:24:46 2023 +0800

    [SPARK-43240][SQL][3.3] Fix the wrong result issue when calling 
df.describe() method
    
    ### What changes were proposed in this pull request?
    The df.describe() method will cached the RDD.  And if the cached RDD is 
RDD[Unsaferow], which may be released after the row is used, then the result 
will be wong. Here we need to copy the RDD before caching as the 
[TakeOrderedAndProjectExec 
](https://github.com/apache/spark/blob/d68d46c9e2cec04541e2457f4778117b570d8cdb/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L204)operator
 does.
    
    ### Why are the changes needed?
    bug fix
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    
    Closes #40914 from JkSelf/describe.
    
    Authored-by: Jia Ke <ke.a....@intel.com>
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>
---
 .../main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
index 9155c1cb6e7..ff6c08cea00 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
@@ -288,7 +288,7 @@ object StatFunctions extends Logging {
     }
 
     // If there is no selected columns, we don't need to run this aggregate, 
so make it a lazy val.
-    lazy val aggResult = ds.select(aggExprs: 
_*).queryExecution.toRdd.collect().head
+    lazy val aggResult = ds.select(aggExprs: 
_*).queryExecution.toRdd.map(_.copy()).collect().head
 
     // We will have one row for each selected statistic in the result.
     val result = Array.fill[InternalRow](selectedStatistics.length) {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method

Reply via email to