[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

mallman Thu, 09 Feb 2017 11:15:07 -0800

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    >> Like you said, users can still create a hive table with 
mixed-case-schema parquet/orc files, by hive or other systems like presto. This 
table is readable for hive, and for Spark prior to 2.1, because of the runtime 
schema inference But this is not intentional, and Spark should not support it 
as the data file schema and table schema mismatch.
    >
    > I will continue to argue strongly against reducing the number of usecases 
Spark SQL supports out of the box. While offering a migration command can offer 
a helpful optimization I don't think it is acceptable as the only option for 
the reasons I've detailed here.
    >
    > Simply put, I think relying on the presence of Spark-specific key/value 
pairs in the table properties in order for Spark SQL to function properly and 
assuming that Spark (or Spark users) can easily alter those properties to add 
the table schema is too brittle for large-scale production use.
    
    I would have to agree with @budde in this case. In versions of Spark prior 
to 2.1, an effort was made to reconcile metastore and file format case 
mismatching using the method `ParquetFileFormat.mergeMetastoreParquetSchema`. 
The code docs for that method state that here: 
https://github.com/apache/spark/blob/1b02f8820ddaf3f2a0e7acc9a7f27afc20683cca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L711-L719.
 I don't see anything here that suggests this was a "hack" or was intended to 
be removed in a later version. It seems we've simply broken compatibility with 
a certain class of Hive tables in Spark 2.1.
    
    Schema inference is very expensive, and doing it at query time on large 
tables was painful in versions prior to Spark 2.1 because all metadata files 
were read. But it seems some people were using it nonetheless and found it 
useful. At least in Spark 2.1, only the files for partitions read in a query 
will be read for schema inference. That would significantly enhance the schema 
inference performance at query time for partitioned tables.
    
    Incidentally, what happens when a program outside of Spark (such as Hive) 
updates the Hive metastore schema of a table with the embedded Spark SQL 
schema? Does Spark detect that change and update the embedded schema? Does it 
have to redo the schema inference across all files in the table?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Reply via email to