Github user budde commented on the issue: https://github.com/apache/spark/pull/16797 @cloud-fan: > Spark does support mixed-case-schema tables, and it has always been. It's because we write table schema to metastore case-preserving, via table properties. Spark prior to 2.1 supported *any* case-sensitive table, regardless of what table properties are set. Spark 2.1 supports these tables if and only if Spark 2.1 was used to create them and embedded the schema as a metadata property. > So the data file schema must match the table schema, or Spark will fail, it has always been. This absolutely not how it's always been. Spark would infer the schema from the source files and use that schema when constructing a logical relation. We've been relying on this behavior for years. > For any table, as long as hive can read it, Spark can read it. I've double checked this and Hive can query against tables backed by case-sensitive Parquet files. Spark 2.1 is currently the only Hive-compatible query engine I'm familiar with that won't support this usecase. > But we supported it, with the cost of runtime schema inference. My argument is that is should be possible to fall back to this level of support if the properties aren't present. > This problem was solved in Spark 2.1, by writing table schema to metastore case-preserving for hive serde tables. Now we can say that, the data schema must match the table schema, or Spark should fail. Spark does not explicitly fail in this case. It falls back to the downcased metastore schema, which will silently fail and return 0 results if a case-sensitive field name is used in your projection or filter predicate. > That's why I prefer the migration command approach, it keeps the concept clean: data schema must match table schema. This links Spark upgrades to potentially-costly data migrations. From an end-user perspective, prior to 2.1 you could simply point Spark SQL to an external Hive metastore and query any data in it. Now you have to make sure the table has been migrated to the appropriate version of Spark or your queries may silently return incorrect results. The migration approach also assumes that Spark has write access to the Hive metastore it is querying against. If you have read-only access to a metastore administered by another team or organization you are at their mercy to run migrations on your behalf against the latest version of Spark in order to allow you to query their tables from Spark. I think anybody who's ever found themselves in a similar situation can attest that it's never good to be beholden to someone else to enable a feature that only matters to you. And again, in some cases migrating all tables in a large Hive warehouse could be an extremely expensive operation that potentially touches petabytes of data. > Like you said, users can still create a hive table with mixed-case-schema parquet/orc files, by hive or other systems like presto. This table is readable for hive, and for Spark prior to 2.1, because of the runtime schema inference But this is not intentional, and Spark should not support it as the data file schema and table schema mismatch. We can make the migration command cover this case too. I will continue to argue strongly against reducing the number of usecases Spark SQL supports out of the box. While offering a migration command can offer a helpful optimization I don't think it is acceptable as the only option for the reasons I've detailed here. Simply put, I think relying on the presence of Spark-specific key/value pairs in the table properties in order for Spark SQL to function properly and assuming that Spark (or Spark users) can easily alter those properties to add the table schema is too brittle for large-scale production use.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org