[ https://issues.apache.org/jira/browse/SPARK-25925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673842#comment-16673842 ]
Alex Ivanov edited comment on SPARK-25925 at 11/3/18 12:05 AM: --------------------------------------------------------------- Thank you for the clarification, [~budde]. This all makes sense, and seems like the better of the two evils, i.e. correctness over performance. Perhaps this can be a suitable documentation change. Right now the only mention of *spark.sql.hive.caseSensitiveInferenceMode* in [Spark SQL Programming Guide|[https://spark.apache.org/docs/latest/sql-programming-guide.html]|https://spark.apache.org/docs/latest/sql-programming-guide.html].] is in the [Upgrading From Spark SQL 2.1 to 2.2|https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-21-to-22] section[.|https://spark.apache.org/docs/latest/sql-programming-guide.html].] If this information is provided in the section [Hive metastore Parquet table conversion|https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion], it would be a lot clearer to users they should definitely consider setting this property to *NEVER_INFER* if they don't have mixed case Parquet schema. Would you be OK with that change? was (Author: axenol): Thank you for the clarification, [~budde]. This all makes sense, and seems like the better of the two evils, i.e. correctness over performance. Perhaps this can be a suitable documentation change. Right now the only mention of *spark.sql.hive.caseSensitiveInferenceMode* in [Spark SQL Programming Guide|[https://spark.apache.org/docs/latest/sql-programming-guide.html]|https://spark.apache.org/docs/latest/sql-programming-guide.html].] is in the [Upgrading From Spark SQL 2.1 to 2.2|https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-21-to-22] section[.|https://spark.apache.org/docs/latest/sql-programming-guide.html].] If this information is provided in the section [Hive metastore Parquet table conversion|https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion], it would be a lot clearer to users they should definitely consider setting this property to *NEVER_INFER* if they don't have mixed case Parquet schema. ** Would you be OK with that change? > Spark 2.3.1 retrieves all partitions from Hive Metastore by default > ------------------------------------------------------------------- > > Key: SPARK-25925 > URL: https://issues.apache.org/jira/browse/SPARK-25925 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1 > Reporter: Alex Ivanov > Priority: Major > > Spark 2.3.1 comes with the following _spark-defaults.conf_ parameters by > default: > {code:java} > spark.sql.hive.convertMetastoreParquet true > spark.sql.hive.metastorePartitionPruning true > spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE{code} > While the first two properties are fine, the last one has an unfortunate > side-effect. I realize it's set to INFER_AND_SAVE for a reason, namely > https://issues.apache.org/jira/browse/SPARK-19611, however that also causes > an issue. > The problem is at this point: > [https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L232] > The inference causes all partitions to be retrieved for the table from Hive > Metastore. This is a problem because even running *explain* on a simple query > on a table with thousands of partitions seems to hang, and is very difficult > to debug. > Moreover, many people will address the issue by changing: > {code:java} > spark.sql.hive.convertMetastoreParquet false{code} > see that it works, and call it a day, thereby forgoing the benefits of using > Parquet support in Spark directly. In our experience, this causes significant > slow-downs on at least some queries. > This Jira is mostly to document the issue, even if it cannot be addressed, so > that people who inevitably run into this behavior can see the resolution, > which is changing the parameter to *NEVER_INFER*, provided there are no > issues with Parquet-Hive schema compatibility, i.e. all of the schema is in > lower-case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org