Github user budde commented on the issue:

    https://github.com/apache/spark/pull/16797
  
    > For better user experience, we should automatically infer the schema and 
write it back to metastore, if there is no case-sensitive table schema in 
metastore. This has the cost of detection the need of schema inference, and 
complicating table read code path.
    
    Totally agree. I think the default behavior should be to infer and backfill 
a case-sensitive schema into the table properties if one isn't already there. 
An option should also be provided to disable all inference and just fall back 
to the case-insensitive metastore schema if none is found (i.e. the current 
behavior in 2.1.0).
    
    > If this is only a compatibility issue, I think it's fair to ask the 
cluster maintainers to run some commands after upgrade Spark cluster. Even 
there are a lot of tables, it's easy to write a script to automate it.
    
    I don't think this is fair. For one, as I've mentioned, in some cases Spark 
may not be the tool being used to maintain the metastore. This will now require 
the warehouse admins to set up a Spark cluster and run these migration commands 
on every table with case-sensitive underlying data if they'd like them to be 
accessible from Spark. As a second point, while writing an automation script 
may be trivial the execution costs aren't, especially if the data is stored in 
a format like JSON where each and every record in the table must be read in 
order to infer the schema.
    
    > If there is no Spark specific table properties, we assume this table is 
created by hive(not by external systems like Presto), so the schema of parquet 
files should be all lowercased.
    
    This isn't an assumption made by Spark prior to 2.1.0, whether this was an 
explicit decision or not. All I'm asking for is a way to configure Spark to 
continue supporting a use case it has for years.
    
    Also, in our case, the table was created by Spark, not Presto. Presto is 
just an example of another execution engine we've put in front of our warehouse 
that hasn't had a problem with the underlying Parquet data being 
case-sensitive. We just used an older version of Spark to create the tables. I 
would think long and hard about whether requiring warehouse admins to run 
potentially-costly migrations between Spark versions to update table metadata 
is a preferable option to offering a way for being backwards-compatible with 
the old behavior.
    
    Again, I think introducing a mechanism to migrate the table properties is a 
good idea. I just don't think it should be the only option.
    
    > Another proposal is to make parquet reader case-insensitive, so that we 
can solve this problem without schema inference. But the problem is, Spark can 
be configured to be case-sensitive, so that it's possible to write such a 
schema (conflicting columns after lower-casing) into metastore. I think this 
proposal is the best if we can totally make Spark case-insensitive.
    
    I don't think this would be a bad option if this could be enabled at the 
Parquet level, but it seems as their work towards enabling case-insensitive 
file access has stalled. As @ericl pointed out above, moving this to the 
ParquetReadSupport level may make the situation better for Parquet but the 
behavior won't be consistent across file formats like ORC or JSON.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to