[jira] [Updated] (SPARK-20881) Clearly document the mechanism to choose between two sources of statistics

Zhenhua Wang (JIRA) Sun, 28 May 2017 10:09:02 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhenhua Wang updated SPARK-20881:
---------------------------------
    Summary: Clearly document the mechanism to choose between two sources of 
statistics  (was: Use Hive's stats in metastore when cbo is disabled)

> Clearly document the mechanism to choose between two sources of statistics
> --------------------------------------------------------------------------
>
>                 Key: SPARK-20881
>                 URL: https://issues.apache.org/jira/browse/SPARK-20881
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Zhenhua Wang
>
> Currently statistics are generated by "analyze command" in Spark. 
> However, when user updates the table and collects stats in Hive, 
> "totalSize"/"numRows" will be updated in metastore. 
> Now, in spark side, table stats become stale. 
> If cbo is enabled, this is ok because we suppose user will handle this and 
> re-run the command to update  stats. 
> If cbo is disabled, then we should fallback to original way and respect 
> hive's stats. But in current implementation, spark's stats always override 
> hive's stats, no matter cbo is enabled or disabled.
> The right thing to do is to use (don't override) hive's stats when cbo is 
> disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20881) Clearly document the mechanism to choose between two sources of statistics

Reply via email to