[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135423114 @liancheng Thanks for the clear investigation and explanation. If I understand it correctly, it means that the original direction of this PR is correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135392980 One thing to note is that, case sensitivity of Spark SQL is configurable ([see here] [1]). So I don't think we should make `StructType` completely case insensitive (yet case preserving). If I understand this issue correctly, the root problem here is that, while writing schema information to physical ORC files, our current approach isn't case preserving. As suggested by @chenghao-intel, when saving a DataFrame as Hive metastore tables using ORC, Spark SQL 1.5 now saves it in a Hive compatible approach, so that we can read the data back using Hive. This implies that, changes made in this PR should also be compatible with Hive. After investigating Hive's behavior for a while, I got some interesting findings. Snippets below were executed against Hive 1.2.1 (with a PostgreSQL metastore) and Spark SQL 1.5-SNAPSHOT ([revision 05c] [2]). Firstly, let's prepare a Hive ORC table: ``` hive> CREATE TABLE orc_test STORED AS ORC AS SELECT 1 AS CoL; ... hive> SELECT col FROM orc_test; OK 1 Time taken: 0.056 seconds, Fetched: 1 row(s) hive> SELECT COL FROM orc_test; OK 1 Time taken: 0.056 seconds, Fetched: 1 row(s) hive> DESC orc_test; OK col int Time taken: 0.047 seconds, Fetched: 1 row(s) ``` So Hive is neither case sensitive nor case preserving. We can further prove this by checking metastore table `COLUMN_V2`: ``` metastore_hive121> SELECT * FROM "COLUMNS_V2" +-+---+---+-+---+ | CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX | |-+---+---+-+---| | 22 | | col | int | 0 | +-+---+---+-+---+ ``` (I cleared my local Hive warehouse, so the only column record here is the one created above.) Now let's read the physical ORC files directly using Spark: ``` scala> sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").printSchema() root |-- _col0: integer (nullable = true) scala> sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").show() +-+ |_col0| +-+ |1| +-+ ``` Huh? Why it's `_col0` instead of `col`? Let's inspect the physical ORC file written by Hive: ``` $ hive --orcfiledump /user/hive/warehouse_hive121/orc_test/00_0 Structure for /user/hive/warehouse_hive121/orc_test/00_0 File Version: 0.12 with HIVE_8732 15/08/27 19:07:15 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse_hive121/orc_test/00_0 with {include: null, offset: 0, length: 9223372036854775807} 15/08/27 19:07:15 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema. Rows: 1 Compression: ZLIB Compression size: 262144 Type: struct<_col0:int> < !!! ... ``` Surprise! So, when writing ORC files, *Hive doesn't even preserve the column names*. Conclusions: 1. Making `StructType` completely case insensitive is unacceptable. 1. Concrete column names written into ORC files by Spark SQL don't affect interoperability with Hive. 1. It would be good for Spark SQL to be case preserving when writing ORC files. And I think this is the task this PR should aim. [1]: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L247-L249 [2]: https://github.com/apache/spark/commit/bb1640529725c6c38103b95af004f8bd905c --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135374076 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41677/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135374072 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135373958 [Test build #41677 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41677/console) for PR 7520 at commit [`a389746`](https://github.com/apache/spark/commit/a38974647ac75a359ae7495af39b93152a437d72). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135334139 Ok. Thanks. Wait for @liancheng's review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135333184 [Test build #41677 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41677/consoleFull) for PR 7520 at commit [`a389746`](https://github.com/apache/spark/commit/a38974647ac75a359ae7495af39b93152a437d72). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135332239 @liancheng have more insights on this part. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135329490 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135329544 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-135329004 @zhzhan @chenghao-intel Thanks for comment. I've updated the PR title and codes. Please check if it is ok for you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org