[jira] [Comment Edited] (SPARK-25925) Spark 2.3.1 retrieves all partitions from Hive Metastore by default

Alex Ivanov (JIRA) Fri, 02 Nov 2018 17:07:00 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673842#comment-16673842
 ]


Alex Ivanov edited comment on SPARK-25925 at 11/3/18 12:05 AM:
---------------------------------------------------------------

Thank you for the clarification, [~budde]. This all makes sense, and seems like 
the better of the two evils, i.e. correctness over performance.

Perhaps this can be a suitable documentation change. Right now the only mention 
of *spark.sql.hive.caseSensitiveInferenceMode* in [Spark SQL Programming 
Guide|[https://spark.apache.org/docs/latest/sql-programming-guide.html]|https://spark.apache.org/docs/latest/sql-programming-guide.html].]
 is in the [Upgrading From Spark SQL 2.1 to 
2.2|https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-21-to-22]
 section[.|https://spark.apache.org/docs/latest/sql-programming-guide.html].] 
If this information is provided in the section [Hive metastore Parquet table 
conversion|https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion],
 it would be a lot clearer to users they should definitely consider setting 
this property to *NEVER_INFER* if they don't have mixed case Parquet schema.

Would you be OK with that change?


was (Author: axenol):
Thank you for the clarification, [~budde]. This all makes sense, and seems like 
the better of the two evils, i.e. correctness over performance.

Perhaps this can be a suitable documentation change. Right now the only mention 
of *spark.sql.hive.caseSensitiveInferenceMode* in [Spark SQL Programming 
Guide|[https://spark.apache.org/docs/latest/sql-programming-guide.html]|https://spark.apache.org/docs/latest/sql-programming-guide.html].]
 is in the [Upgrading From Spark SQL 2.1 to 
2.2|https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-21-to-22]
 section[.|https://spark.apache.org/docs/latest/sql-programming-guide.html].] 
If this information is provided in the section [Hive metastore Parquet table 
conversion|https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion],
 it would be a lot clearer to users they should definitely consider setting 
this property to *NEVER_INFER* if they don't have mixed case Parquet schema.
**

Would you be OK with that change?

> Spark 2.3.1 retrieves all partitions from Hive Metastore by default
> -------------------------------------------------------------------
>
>                 Key: SPARK-25925
>                 URL: https://issues.apache.org/jira/browse/SPARK-25925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Alex Ivanov
>            Priority: Major
>
> Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by 
> default:
> {code:java}
> spark.sql.hive.convertMetastoreParquet true
> spark.sql.hive.metastorePartitionPruning true
> spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code}
> While the first two properties are fine, the last one has an unfortunate 
> side-effect. I realize it's set to INFER_AND_SAVE for a reason, namely 
> https://issues.apache.org/jira/browse/SPARK-19611, however that also causes 
> an issue.
> The problem is at this point:
> [https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232]
> The inference causes all partitions to be retrieved for the table from Hive 
> Metastore. This is a problem because even running *explain* on a simple query 
> on a table with thousands of partitions seems to hang, and is very difficult 
> to debug.
> Moreover, many people will address the issue by changing:
> {code:java}
> spark.sql.hive.convertMetastoreParquet false{code}
> see that it works, and call it a day, thereby forgoing the benefits of using 
> Parquet support in Spark directly. In our experience, this causes significant 
> slow-downs on at least some queries.
> This Jira is mostly to document the issue, even if it cannot be addressed, so 
> that people who inevitably run into this behavior can see the resolution, 
> which is changing the parameter to *NEVER_INFER*, provided there are no 
> issues with Parquet-Hive schema compatibility, i.e. all of the schema is in 
> lower-case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25925) Spark 2.3.1 retrieves all partitions from Hive Metastore by default

Reply via email to