Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16797
>> Like you said, users can still create a hive table with
mixed-case-schema parquet/orc files, by hive or other systems like presto. This
table is readable for hive, and for Spark prior to 2.1, because of the runtime
schema inference But this is not intentional, and Spark should not support it
as the data file schema and table schema mismatch.
>
> I will continue to argue strongly against reducing the number of usecases
Spark SQL supports out of the box. While offering a migration command can offer
a helpful optimization I don't think it is acceptable as the only option for
the reasons I've detailed here.
>
> Simply put, I think relying on the presence of Spark-specific key/value
pairs in the table properties in order for Spark SQL to function properly and
assuming that Spark (or Spark users) can easily alter those properties to add
the table schema is too brittle for large-scale production use.
I would have to agree with @budde in this case. In versions of Spark prior
to 2.1, an effort was made to reconcile metastore and file format case
mismatching using the method `ParquetFileFormat.mergeMetastoreParquetSchema`.
The code docs for that method state that here:
https://github.com/apache/spark/blob/1b02f8820ddaf3f2a0e7acc9a7f27afc20683cca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L711-L719.
I don't see anything here that suggests this was a "hack" or was intended to
be removed in a later version. It seems we've simply broken compatibility with
a certain class of Hive tables in Spark 2.1.
Schema inference is very expensive, and doing it at query time on large
tables was painful in versions prior to Spark 2.1 because all metadata files
were read. But it seems some people were using it nonetheless and found it
useful. At least in Spark 2.1, only the files for partitions read in a query
will be read for schema inference. That would significantly enhance the schema
inference performance at query time for partitioned tables.
Incidentally, what happens when a program outside of Spark (such as Hive)
updates the Hive metastore schema of a table with the embedded Spark SQL
schema? Does Spark detect that change and update the embedded schema? Does it
have to redo the schema inference across all files in the table?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]