[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

cloud-fan Wed, 08 Feb 2017 19:06:12 -0800

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    @budde Spark does support mixed-case-schema tables, and it has always been. 
It's because we write table schema to metastore case-preserving, via table 
properties. When we read a table, we get schema from metastore and assume it's 
the schema of the table data files. So the data file schema must match the 
table schema, or Spark will fail, it has always been.
    
    However, there is one exception. There are 2 kinds of tables in Spark: data 
source tables and hive serde tables(we have different SQL syntax to create 
them). Data source tables are totally managed by Spark, we read/write data 
files directly and only use hive metastore as a persistent layer, which means 
data source tables are not compatible with hive, hive can't read/write it.
    
    For hive serde tables, it should be compatible with hive and we use hive 
api to read/write it. For any table, as long as hive can read it, Spark can 
read it. However, the exception is, for parquet and orc formats, we will read 
data files directly, as an optimization(reading using hive api is slow). Before 
Spark 2.1, we save schema to hive metastore directly, which means schema will 
be lowercased. w.r.t. this, ideally we should not support mixed-case-schema 
parquet/orc data files for this kind of table, or the data schema will mismatch 
the table schema. But we supported it, with the cost of runtime schema 
inference.
    
    This problem was solved in Spark 2.1, by writing table schema to metastore 
case-preserving for hive serde tables. Now we can say that, the data schema 
must match the table schema, or Spark should fail.
    
    Then comes to this problem: for parquet/orc format hive serde tables 
created by Spark prior to 2.1, the data file schema may not match the table 
schema, but we need to still support it for compatibility.
    
    That's why I prefer the migration command approach, it keeps the concept 
clean: data schema must match table schema.
    
    Like you said, users can still create a hive table with mixed-case-schema 
parquet/orc files, by hive or other systems like presto. This table is readable 
for hive, and for Spark prior to 2.1, because of the runtime schema inference. 
But this is not intentional, and Spark should not support it as the data file 
schema and table schema mismatch. We can make the migration command cover this 
case too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Reply via email to