[ 
https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-21083:
---------------------------------
    Description: 
1. When we first analyze without `noscan` and then analyze with `noscan`, the 
table is not changed, so we should keep row count in statistics.
2. When we first analyze one column in table and then analyze another column, 
the table is not changed, so we should keep the previous column stats and 
combine them with the newly collected column stats.

  was:
Suppose we already collected column stats for some columns before, then, when 
we collect column stats for other columns:
* If the table is changed during two collecting actions, we need to remove 
these stale column stats, only keep the latest stats.
* Otherwise, combine these two sets of column stats.

Note that we always update sizeInBytes/rowCount when collecting column stats, 
that logic doesn't need change.


> Consider staleness when collecting column stats
> -----------------------------------------------
>
>                 Key: SPARK-21083
>                 URL: https://issues.apache.org/jira/browse/SPARK-21083
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Zhenhua Wang
>
> 1. When we first analyze without `noscan` and then analyze with `noscan`, the 
> table is not changed, so we should keep row count in statistics.
> 2. When we first analyze one column in table and then analyze another column, 
> the table is not changed, so we should keep the previous column stats and 
> combine them with the newly collected column stats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to