[ 
https://issues.apache.org/jira/browse/SPARK-30394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-30394:
---------------------------------
    Description: 
Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will 
scan hdfs files to collect table stats in `DetermineTableStats` rule. But this 
can be expensive and not accurate(only file size on disk, not accounting 
compression factor), acutually we can skip this if this hive table can be 
converted to datasource table(parquet etc.), and do better estimation in 
`HadoopFsRelation`.

BeforeSPARK-28573, the implementation will update the CatalogTableStatistics, 
which will cause the improper stats(for parquet, this size is greatly smaller 
than real size in memory) be used in joinSelection when the hive table can be 
convert to datasource table.

In our production environment, user's highly compressed parquet table can cause 
OOMs when doing `broadcastHashJoin` due to this improper stats.

  was:
Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will 
scan hdfs files to collect table stats in `DetermineTableStats` rule. But this 
can be expensive in some cases, acutually we can skip this if this hive table 
can be converted to datasource table(parquet etc.).

Before[SPARK-28573|https://issues.apache.org/jira/browse/SPARK-28573], the 
implementaion will update the CatalogTableStatistics, which will cause the 
improper stats be used in joinSelection when the hive table can be convert to 
datasource table.

In our production environment, user's highly compressed parquet table can cause 
OOMs when doing `broadcastHashJoin` due to this improper stats.


> Skip collecting stats in DetermineTableStats rule when hive table is 
> convertible to  datasource tables
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30394
>                 URL: https://issues.apache.org/jira/browse/SPARK-30394
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.2, 3.0.0
>            Reporter: liupengcheng
>            Priority: Major
>
> Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark 
> will scan hdfs files to collect table stats in `DetermineTableStats` rule. 
> But this can be expensive and not accurate(only file size on disk, not 
> accounting compression factor), acutually we can skip this if this hive table 
> can be converted to datasource table(parquet etc.), and do better estimation 
> in `HadoopFsRelation`.
> BeforeSPARK-28573, the implementation will update the CatalogTableStatistics, 
> which will cause the improper stats(for parquet, this size is greatly smaller 
> than real size in memory) be used in joinSelection when the hive table can be 
> convert to datasource table.
> In our production environment, user's highly compressed parquet table can 
> cause OOMs when doing `broadcastHashJoin` due to this improper stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to