[ 
https://issues.apache.org/jira/browse/SPARK-25925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678861#comment-16678861
 ] 

Adam Budde commented on SPARK-25925:
------------------------------------

[~axenol] I would definitely support making the documentation clearer in this 
instance.

> Spark 2.3.1 retrieves all partitions from Hive Metastore by default
> -------------------------------------------------------------------
>
>                 Key: SPARK-25925
>                 URL: https://issues.apache.org/jira/browse/SPARK-25925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Alex Ivanov
>            Priority: Major
>
> Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by 
> default:
> {code:java}
> spark.sql.hive.convertMetastoreParquet true
> spark.sql.hive.metastorePartitionPruning true
> spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code}
> While the first two properties are fine, the last one has an unfortunate 
> side-effect. I realize it's set to INFER_AND_SAVE for a reason, namely 
> https://issues.apache.org/jira/browse/SPARK-19611, however that also causes 
> an issue.
> The problem is at this point:
> [https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232]
> The inference causes all partitions to be retrieved for the table from Hive 
> Metastore. This is a problem because even running *explain* on a simple query 
> on a table with thousands of partitions seems to hang, and is very difficult 
> to debug.
> Moreover, many people will address the issue by changing:
> {code:java}
> spark.sql.hive.convertMetastoreParquet false{code}
> see that it works, and call it a day, thereby forgoing the benefits of using 
> Parquet support in Spark directly. In our experience, this causes significant 
> slow-downs on at least some queries.
> This Jira is mostly to document the issue, even if it cannot be addressed, so 
> that people who inevitably run into this behavior can see the resolution, 
> which is changing the parameter to *NEVER_INFER*, provided there are no 
> issues with Parquet-Hive schema compatibility, i.e. all of the schema is in 
> lower-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to