[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135423114
  
@liancheng Thanks for the clear investigation and explanation.

If I understand it correctly, it means that the original direction of this 
PR is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135392980
  
One thing to note is that, case sensitivity of Spark SQL is configurable 
([see here] [1]). So I don't think we should make `StructType` completely case 
insensitive (yet case preserving).

If I understand this issue correctly, the root problem here is that, while 
writing schema information to physical ORC files, our current approach isn't 
case preserving.  As suggested by @chenghao-intel, when saving a DataFrame as 
Hive metastore tables using ORC, Spark SQL 1.5 now saves it in a Hive 
compatible approach, so that we can read the data back using Hive.  This 
implies that, changes made in this PR should also be compatible with Hive.  
After investigating Hive's behavior for a while, I got some interesting 
findings.

Snippets below were executed against Hive 1.2.1 (with a PostgreSQL 
metastore) and Spark SQL 1.5-SNAPSHOT ([revision 05c] [2]).  Firstly, let's 
prepare a Hive ORC table:

```
hive> CREATE TABLE orc_test STORED AS ORC AS SELECT 1 AS CoL;
...
hive> SELECT col FROM orc_test;
OK
1
Time taken: 0.056 seconds, Fetched: 1 row(s)
hive> SELECT COL FROM orc_test;
OK
1
Time taken: 0.056 seconds, Fetched: 1 row(s)
hive> DESC orc_test;
OK
col int
Time taken: 0.047 seconds, Fetched: 1 row(s)
```

So Hive is neither case sensitive nor case preserving.  We can further 
prove this by checking metastore table `COLUMN_V2`:

```
metastore_hive121> SELECT * FROM "COLUMNS_V2"
+-+---+---+-+---+
|   CD_ID |   COMMENT | COLUMN_NAME   | TYPE_NAME   |   INTEGER_IDX |
|-+---+---+-+---|
|  22 | | col   | int | 0 |
+-+---+---+-+---+
```

(I cleared my local Hive warehouse, so the only column record here is the 
one created above.)

Now let's read the physical ORC files directly using Spark:

```
scala> 
sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").printSchema()
root
 |-- _col0: integer (nullable = true)

scala> 
sqlContext.read.orc("hdfs://localhost:9000/user/hive/warehouse_hive121/orc_test").show()
+-+
|_col0|
+-+
|1|
+-+
```

Huh? Why it's `_col0` instead of `col`?  Let's inspect the physical ORC 
file written by Hive:

```
$ hive --orcfiledump /user/hive/warehouse_hive121/orc_test/00_0

Structure for /user/hive/warehouse_hive121/orc_test/00_0
File Version: 0.12 with HIVE_8732
15/08/27 19:07:15 INFO orc.ReaderImpl: Reading ORC rows from 
/user/hive/warehouse_hive121/orc_test/00_0 with {include: null, offset: 0, 
length: 9223372036854775807}
15/08/27 19:07:15 INFO orc.RecordReaderFactory: Schema is not specified on 
read. Using file schema.
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int> < !!!
...
```

Surprise!  So, when writing ORC files, *Hive doesn't even preserve the 
column names*.

Conclusions:

1.  Making `StructType` completely case insensitive is unacceptable.
1.  Concrete column names written into ORC files by Spark SQL don't affect 
interoperability with Hive.
1.  It would be good for Spark SQL to be case preserving when writing ORC 
files.

And I think this is the task this PR should aim.

[1]: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L247-L249
[2]: 
https://github.com/apache/spark/commit/bb1640529725c6c38103b95af004f8bd905c


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135374076
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41677/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135374072
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135373958
  
  [Test build #41677 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41677/console)
 for   PR 7520 at commit 
[`a389746`](https://github.com/apache/spark/commit/a38974647ac75a359ae7495af39b93152a437d72).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135334139
  
Ok. Thanks. Wait for @liancheng's review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135333184
  
  [Test build #41677 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41677/consoleFull)
 for   PR 7520 at commit 
[`a389746`](https://github.com/apache/spark/commit/a38974647ac75a359ae7495af39b93152a437d72).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread zhzhan
Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135332239
  
@liancheng have more insights on this part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135329490
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135329544
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9170][SQL] User-provided columns should...

2015-08-27 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/7520#issuecomment-135329004
  
@zhzhan @chenghao-intel Thanks for comment. I've updated the PR title and 
codes. Please check if it is ok for you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org