Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > how about we add a new SQL command to refresh the table schema in metastore by inferring schema with data files? This is a compatibility issue and we should have provided a way for users to migrate, before the 2.1 release. I think this approach is much simpler than adding a flag. While I think introducing a command for inferring and storing the table's case-sensitive schema as a property would be a welcome addition, I think requiring this property to be there in order for Spark SQL to function properly with case-sensitive data files could really restrict the settings Spark SQL can be used in. If a user wanted to use Spark SQL to query over an existing warehouse containing hundreds or even thousands of tables, under the suggested approach a Spark job would have to be run to infer the schema of each and every table. file formats such as Parquet store their schemas as metadata there still could potentially be millions of files to inspect for the warehouse. A less amenable format like JSON might require scanning all the data in the warehouse. This also doesn't cover the use case @viirya pointed our where the user may not have write access to the metastore they are querying against. In this case, the user would have to rely on the warehouse administrator to create the Spark schema property for every table they wish to query. > For tables created by hive, as hive is a case-insensitive system, will the parquet files have mixed-case schema? I think the Hive Metastore has become a bit of an open standard for maintaining a data warehouse catalog since so many tools integrate with it. I wouldn't assume that the underlying data pointed to by an external metastore was created or managed by Hive itself. For example, we maintain a Hive Metastore that catalogs case-sensitive files written by our Spark-based ETL pipeline, which parses case classes from string data and writes them as Parquet.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org