Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135392980 One thing to note is that, case sensitivity of Spark SQL is configurable ([see here] [1]). So I don't think we should make `StructType` completely case insensitive (yet case preserving). If I understand this issue correctly, the root problem here is that, while writing schema information to physical ORC files, our current approach isn't case preserving. As suggested by @chenghao-intel, when saving a DataFrame as Hive metastore tables using ORC, Spark SQL 1.5 now saves it in a Hive compatible approach, so that we can read the data back using Hive. This implies that, changes made in this PR should also be compatible with Hive. After investigating Hive's behavior for a while, I got some interesting findings. Snippets below were executed against Hive 1.2.1 (with a PostgreSQL metastore) and Spark SQL 1.5-SNAPSHOT ([revision 0eeee5c] [2]). Firstly, let's prepare a Hive ORC table: ``` hive> CREATE TABLE orc_test STORED AS ORC AS SELECT 1 AS CoL; ... hive> SELECT col FROM orc_test; OK 1 Time taken: 0.056 seconds, Fetched: 1 row(s) hive> SELECT COL FROM orc_test; OK 1 Time taken: 0.056 seconds, Fetched: 1 row(s) hive> DESC orc_test; OK col int Time taken: 0.047 seconds, Fetched: 1 row(s) ``` So Hive is neither case sensitive nor case preserving. We can further prove this by checking metastore table `COLUMN_V2`: ``` metastore_hive121> SELECT * FROM "COLUMNS_V2" +---------+-----------+---------------+-------------+---------------+ | CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX | |---------+-----------+---------------+-------------+---------------| | 22 | <null> | col | int | 0 | +---------+-----------+---------------+-------------+---------------+ ``` (I cleared my local Hive warehouse, so the only column record here is the one created above.) Now let's read the physical ORC files directly using Spark: ``` scala> sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").printSchema() root |-- _col0: integer (nullable = true) scala> sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").show() +-----+ |_col0| +-----+ | 1| +-----+ ``` Huh? Why it's `_col0` instead of `col`? Let's inspect the physical ORC file written by Hive: ``` $ hive --orcfiledump /user/hive/warehouse_hive121/orc_test/000000_0 Structure for /user/hive/warehouse_hive121/orc_test/000000_0 File Version: 0.12 with HIVE_8732 15/08/27 19:07:15 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse_hive121/orc_test/000000_0 with {include: null, offset: 0, length: 9223372036854775807} 15/08/27 19:07:15 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema. Rows: 1 Compression: ZLIB Compression size: 262144 Type: struct<_col0:int> <---- !!! ... ``` Surprise! So, when writing ORC files, *Hive doesn't even preserve the column names*. Conclusions: 1. Making `StructType` completely case insensitive is unacceptable. 1. Concrete column names written into ORC files by Spark SQL don't affect interoperability with Hive. 1. It would be good for Spark SQL to be case preserving when writing ORC files. And I think this is the task this PR should aim. [1]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L247-L249 [2]: https://github.com/apache/spark/commit/bb1640529725c6c38103b95af004f8bd90eeee5c
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org