Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 > For better user experience, we should automatically infer the schema and write it back to metastore, if there is no case-sensitive table schema in metastore. This has the cost of detection the need of schema inference, and complicating table read code path. Totally agree. I think the default behavior should be to infer and backfill a case-sensitive schema into the table properties if one isn't already there. An option should also be provided to disable all inference and just fall back to the case-insensitive metastore schema if none is found (i.e. the current behavior in 2.1.0). > If this is only a compatibility issue, I think it's fair to ask the cluster maintainers to run some commands after upgrade Spark cluster. Even there are a lot of tables, it's easy to write a script to automate it. I don't think this is fair. For one, as I've mentioned, in some cases Spark may not be the tool being used to maintain the metastore. This will now require the warehouse admins to set up a Spark cluster and run these migration commands on every table with case-sensitive underlying data if they'd like them to be accessible from Spark. As a second point, while writing an automation script may be trivial the execution costs aren't, especially if the data is stored in a format like JSON where each and every record in the table must be read in order to infer the schema. > If there is no Spark specific table properties, we assume this table is created by hive(not by external systems like Presto), so the schema of parquet files should be all lowercased. This isn't an assumption made by Spark prior to 2.1.0, whether this was an explicit decision or not. All I'm asking for is a way to configure Spark to continue supporting a use case it has for years. Also, in our case, the table was created by Spark, not Presto. Presto is just an example of another execution engine we've put in front of our warehouse that hasn't had a problem with the underlying Parquet data being case-sensitive. We just used an older version of Spark to create the tables. I would think long and hard about whether requiring warehouse admins to run potentially-costly migrations between Spark versions to update table metadata is a preferable option to offering a way for being backwards-compatible with the old behavior. Again, I think introducing a mechanism to migrate the table properties is a good idea. I just don't think it should be the only option. > Another proposal is to make parquet reader case-insensitive, so that we can solve this problem without schema inference. But the problem is, Spark can be configured to be case-sensitive, so that it's possible to write such a schema (conflicting columns after lower-casing) into metastore. I think this proposal is the best if we can totally make Spark case-insensitive. I don't think this would be a bad option if this could be enabled at the Parquet level, but it seems as their work towards enabling case-insensitive file access has stalled. As @ericl pointed out above, moving this to the ParquetReadSupport level may make the situation better for Parquet but the behavior won't be consistent across file formats like ORC or JSON.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org