[jira] [Resolved] (SPARK-39991) AQE should use available column statistics from completed query stages

Dongjoon Hyun (Jira) Mon, 19 Sep 2022 14:18:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-39991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun resolved SPARK-39991.
-----------------------------------
    Fix Version/s: 3.4.0
       Resolution: Fixed

Issue resolved by pull request 37424
[https://github.com/apache/spark/pull/37424]

> AQE should use available column statistics from completed query stages
> ----------------------------------------------------------------------
>
>                 Key: SPARK-39991
>                 URL: https://issues.apache.org/jira/browse/SPARK-39991
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Andy Grove
>            Assignee: Andy Grove
>            Priority: Major
>             Fix For: 3.4.0
>
>
> In QueryStageExec.computeStats we copy partial statistics from materlized 
> query stages by calling QueryStageExec#getRuntimeStatistics, which in turn 
> calls ShuffleExchangeLike#runtimeStatistics or 
> BroadcastExchangeLike#runtimeStatistics.
> Only dataSize and numOutputRows are copied into the new Statistics object:
>  {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
>     val runtimeStats = getRuntimeStatistics
>     val dataSize = runtimeStats.sizeInBytes.max(0)
>     val numOutputRows = runtimeStats.rowCount.map(_.max(0))
>     Some(Statistics(dataSize, numOutputRows, isRuntime = true))
>   } else {
>     None
>   }
> {code}
> I would like to also copy over the column statistics stored in 
> Statistics.attributeMap so that they can be fed back into the logical plan 
> optimization phase. This is a small change as shown below:
> {code:scala}
>   def computeStats(): Option[Statistics] = if (isMaterialized) {
>     val runtimeStats = getRuntimeStatistics
>     val dataSize = runtimeStats.sizeInBytes.max(0)
>     val numOutputRows = runtimeStats.rowCount.map(_.max(0))
>     val attributeStats = runtimeStats.attributeStats
>     Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = 
> true))
>   } else {
>     None
>   }
> {code}
> The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
> not currently provide such column statistics, but other custom 
> implementations can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39991) AQE should use available column statistics from completed query stages

Reply via email to