[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

budde Thu, 09 Feb 2017 10:41:35 -0800

Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    @cloud-fan:
    
    > Spark does support mixed-case-schema tables, and it has always been. It's 
because we write table schema to metastore case-preserving, via table 
properties.
    
    Spark prior to 2.1 supported *any* case-sensitive table, regardless of what 
table properties are set. Spark 2.1 supports these tables if and only if Spark 
2.1 was used to create them and embedded the schema as a metadata property.
    
    > So the data file schema must match the table schema, or Spark will fail, 
it has always been.
    
    This absolutely not how it's always been. Spark would infer the schema from 
the source files and use that schema when constructing a logical relation. 
We've been relying on this behavior for years.
    
    > For any table, as long as hive can read it, Spark can read it.
    
    I've double checked this and Hive can query against tables backed by 
case-sensitive Parquet files. Spark 2.1 is currently the only Hive-compatible 
query engine I'm familiar with that won't support this usecase.
    
    > But we supported it, with the cost of runtime schema inference.
    
    My argument is that is should be possible to fall back to this level of 
support if the properties aren't present.
    
    > This problem was solved in Spark 2.1, by writing table schema to 
metastore case-preserving for hive serde tables. Now we can say that, the data 
schema must match the table schema, or Spark should fail.
    
    Spark does not explicitly fail in this case. It falls back to the downcased 
metastore schema, which will silently fail and return 0 results if a 
case-sensitive field name is used in your projection or filter predicate.
    
    > That's why I prefer the migration command approach, it keeps the concept 
clean: data schema must match table schema.
    
    This links Spark upgrades to potentially-costly data migrations. From an 
end-user perspective, prior to 2.1 you could simply point Spark SQL to an 
external Hive metastore and query any data in it. Now you have to make sure the 
table has been migrated to the appropriate version of Spark or your queries may 
silently return incorrect results.
    
    The migration approach also assumes that Spark has write access to the Hive 
metastore it is querying against. If you have read-only access to a metastore 
administered by another team or organization you are at their mercy to run 
migrations on your behalf against the latest version of Spark in order to allow 
you to query their tables from Spark. I think anybody who's ever found 
themselves in a similar situation can attest that it's never good to be 
beholden to someone else to enable a feature that only matters to you.
    
    And again, in some cases migrating all tables in a large Hive warehouse 
could be an extremely expensive operation that potentially touches petabytes of 
data.
    
    > Like you said, users can still create a hive table with mixed-case-schema 
parquet/orc files, by hive or other systems like presto. This table is readable 
for hive, and for Spark prior to 2.1, because of the runtime schema inference 
But this is not intentional, and Spark should not support it as the data file 
schema and table schema mismatch. We can make the migration command cover this 
case too.
    
    I will continue to argue strongly against reducing the number of usecases 
Spark SQL supports out of the box. While offering a migration command can offer 
a helpful optimization I don't think it is acceptable as the only option for 
the reasons I've detailed here.
    
    Simply put, I think relying on the presence of Spark-specific key/value 
pairs in the table properties in order for Spark SQL to function properly and 
assuming that Spark (or Spark users) can easily alter those properties to add 
the table schema is too brittle for large-scale production use.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

Reply via email to