[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-12-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19571
  
This is resolved in https://github.com/apache/spark/pull/19651 .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-29 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19571
  
yes please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19571
  
I see. Then, can we continue on #17980 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19571
  
Sorry I miss-understood the problem at the beginning. I thought the new orc 
version just changes the existing APIs a little, but it turns out the new orc 
version has a new set of read/write APIs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-26 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19571
  
To be clear, for [ORC File 
Versions](https://github.com/apache/orc/blob/master/proto/orc_proto.proto#L240-L248),
 there exists some ORC test case against version 0.11, but it's not our scope 
because Spark (and Hive 1.2) uses `0.12 with HIVE_8732`.

There are 6 versions with 0.12.

0 = original
**1 = HIVE-8732 fixed (fixed stripe/file maximum statistics & string 
statistics use utf8 for min/max)**
2 = HIVE-4243 fixed (use real column names from Hive tables)
3 = HIVE-12055 fixed (vectorized writer implementation)
4 = HIVE-13083 fixed (decimals write present stream correctly)
5 = ORC-101 fixed (bloom filters use utf8 consistently)
6 = ORC-135 fixed (timestamp statistics use utc)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-26 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19571
  
@gatorsmile and @cloud-fan . 
For ORC compatibility, I checked the ORC code, but it's not clearly tested.
I'll try to add some suite as a separate issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19571
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83072/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19571
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19571
  
**[Test build #83072 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83072/testReport)**
 for PR 19571 at commit 
[`8d212f0`](https://github.com/apache/spark/commit/8d212f049ccd176e5d6800d620929eed20844415).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19571
  
Thank you for review, @gatorsmile and @cloud-fan . Especially, @cloud-fan 
's opinion is my original approach in #17980 and #18953 (before Aug 16). I 
cannot agree any more.

> Basically we leave the old orc data source as it is, and implement a new 
orc 1.4.1 data source in sql core module. Then we have an internal config to 
switch the implementation(by default prefer the new implementation), and remove 
the old implementation after one or two releases.

BTW, I'm wondering what is changed after you commented [the 
following](https://github.com/apache/spark/pull/18953#issuecomment-322827590) 
on that PR on 16th Aug.

> Are the ORC APIs changed a lot in 1.4? I was expecting a small patch to 
upgrade the current ORC data source, without moving it to sql/core.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19571
  
**[Test build #83072 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83072/testReport)**
 for PR 19571 at commit 
[`8d212f0`](https://github.com/apache/spark/commit/8d212f049ccd176e5d6800d620929eed20844415).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19571
  
> What is the backward compatibility of ORC 1.4.1? Can we create multiple 
ORC files created by the previous versions and ensure they are not broken?

That a good point, and I think it's better to have these tests in the orc 
project. If they don't have, then we can take over and add these tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19571
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83055/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19571
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19571
  
**[Test build #83055 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83055/testReport)**
 for PR 19571 at commit 
[`8b4fc96`](https://github.com/apache/spark/commit/8b4fc96a2f6ed56403a68a6dc401d54380c29fa9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19571
  
I checked with how we introduce the new parquet reader before, and maybe we 
can follow it: https://github.com/apache/spark/pull/4308

Basically we leave the old orc data source as it is, and implement a new 
orc 1.4.1 data source in sql core module. Then we have an internal config to 
switch the implementation(by default true), and remove the old implementation 
after one or two releases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19571
  
What is the backward compatibility of ORC  1.4.1? Can we create multiple 
ORC files created by the previous versions and ensure they are not broken?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19571
  
**[Test build #83055 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83055/testReport)**
 for PR 19571 at commit 
[`8b4fc96`](https://github.com/apache/spark/commit/8b4fc96a2f6ed56403a68a6dc401d54380c29fa9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19571
  
I updated the PR.
Could you review this PR again, @viirya , @HyukjinKwon , @gatorsmile , 
@cloud-fan ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19571
  
Thank you for review, @viirya and @HyukjinKwon .
You know that I tried to introduce new OrcFileFormat in `sql/core` before. 
But, it is too big for review. According to @cloud-fan 's advice, I'm trying to 
upgrade the existing one by one in a piece.

So far,
- We introduced new ORC 1.4.1 dependency
- Introduce new Spark SQL ORC parameters and replace Hive constant with new 
ORC parameters.

This is the actual first PR to use read and write using ORC 1.4.1 library.
- It reads ORC file only for inferencing schema.
- It writes only empty ORC file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19571
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19571
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83030/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19571
  
**[Test build #83030 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83030/testReport)**
 for PR 19571 at commit 
[`be7ba9b`](https://github.com/apache/spark/commit/be7ba9b5a9c70519a7fa1b0497955fbba763e2e6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...

2017-10-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19571
  
**[Test build #83030 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83030/testReport)**
 for PR 19571 at commit 
[`be7ba9b`](https://github.com/apache/spark/commit/be7ba9b5a9c70519a7fa1b0497955fbba763e2e6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org