[ 
https://issues.apache.org/jira/browse/SPARK-20881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-20881:
---------------------------------
    Description: 
Currently statistics are generated by "analyze command" in Spark. 

However, when user updates the table and collects stats in Hive, 
"totalSize"/"numRows" will be updated in metastore. 

Now, in spark side, table stats become stale. 
If cbo is enabled, this is ok because we suppose user will handle this and 
re-run the command to update  stats. 
If cbo is disabled, then we should fallback to original way and respect hive's 
stats. But in current implementation, spark's stats always override hive's 
stats, no matter cbo is enabled or disabled.

The right thing to do is to use (don't override) hive's stats when cbo is 
disabled.

  was:
Spark's statistics are generated by "analyze command". 

However, when user updates the table and collects stats in Hive, 
"totalSize"/"numRows" will be updated in metastore. 

Now, in spark side, table stats are stale even if we turn off cbo, because in 
current implementation, spark's stats always override hive's stats, no matter 
cbo is enabled or disabled.

The right thing to do is to use hive's stats in this case.


> Use Hive's stats in metastore when cbo is disabled
> --------------------------------------------------
>
>                 Key: SPARK-20881
>                 URL: https://issues.apache.org/jira/browse/SPARK-20881
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Zhenhua Wang
>
> Currently statistics are generated by "analyze command" in Spark. 
> However, when user updates the table and collects stats in Hive, 
> "totalSize"/"numRows" will be updated in metastore. 
> Now, in spark side, table stats become stale. 
> If cbo is enabled, this is ok because we suppose user will handle this and 
> re-run the command to update  stats. 
> If cbo is disabled, then we should fallback to original way and respect 
> hive's stats. But in current implementation, spark's stats always override 
> hive's stats, no matter cbo is enabled or disabled.
> The right thing to do is to use (don't override) hive's stats when cbo is 
> disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to