[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19571 This is resolved in https://github.com/apache/spark/pull/19651 . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19571 yes please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19571 I see. Then, can we continue on #17980 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19571 Sorry I miss-understood the problem at the beginning. I thought the new orc version just changes the existing APIs a little, but it turns out the new orc version has a new set of read/write APIs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19571 To be clear, for [ORC File Versions](https://github.com/apache/orc/blob/master/proto/orc_proto.proto#L240-L248), there exists some ORC test case against version 0.11, but it's not our scope because Spark (and Hive 1.2) uses `0.12 with HIVE_8732`. There are 6 versions with 0.12. 0 = original **1 = HIVE-8732 fixed (fixed stripe/file maximum statistics & string statistics use utf8 for min/max)** 2 = HIVE-4243 fixed (use real column names from Hive tables) 3 = HIVE-12055 fixed (vectorized writer implementation) 4 = HIVE-13083 fixed (decimals write present stream correctly) 5 = ORC-101 fixed (bloom filters use utf8 consistently) 6 = ORC-135 fixed (timestamp statistics use utc) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19571 @gatorsmile and @cloud-fan . For ORC compatibility, I checked the ORC code, but it's not clearly tested. I'll try to add some suite as a separate issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19571 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83072/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19571 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19571 **[Test build #83072 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83072/testReport)** for PR 19571 at commit [`8d212f0`](https://github.com/apache/spark/commit/8d212f049ccd176e5d6800d620929eed20844415). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19571 Thank you for review, @gatorsmile and @cloud-fan . Especially, @cloud-fan 's opinion is my original approach in #17980 and #18953 (before Aug 16). I cannot agree any more. > Basically we leave the old orc data source as it is, and implement a new orc 1.4.1 data source in sql core module. Then we have an internal config to switch the implementation(by default prefer the new implementation), and remove the old implementation after one or two releases. BTW, I'm wondering what is changed after you commented [the following](https://github.com/apache/spark/pull/18953#issuecomment-322827590) on that PR on 16th Aug. > Are the ORC APIs changed a lot in 1.4? I was expecting a small patch to upgrade the current ORC data source, without moving it to sql/core. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19571 **[Test build #83072 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83072/testReport)** for PR 19571 at commit [`8d212f0`](https://github.com/apache/spark/commit/8d212f049ccd176e5d6800d620929eed20844415). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19571 > What is the backward compatibility of ORC 1.4.1? Can we create multiple ORC files created by the previous versions and ensure they are not broken? That a good point, and I think it's better to have these tests in the orc project. If they don't have, then we can take over and add these tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19571 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83055/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19571 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19571 **[Test build #83055 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83055/testReport)** for PR 19571 at commit [`8b4fc96`](https://github.com/apache/spark/commit/8b4fc96a2f6ed56403a68a6dc401d54380c29fa9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19571 I checked with how we introduce the new parquet reader before, and maybe we can follow it: https://github.com/apache/spark/pull/4308 Basically we leave the old orc data source as it is, and implement a new orc 1.4.1 data source in sql core module. Then we have an internal config to switch the implementation(by default true), and remove the old implementation after one or two releases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19571 What is the backward compatibility of ORC 1.4.1? Can we create multiple ORC files created by the previous versions and ensure they are not broken? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19571 **[Test build #83055 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83055/testReport)** for PR 19571 at commit [`8b4fc96`](https://github.com/apache/spark/commit/8b4fc96a2f6ed56403a68a6dc401d54380c29fa9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19571 I updated the PR. Could you review this PR again, @viirya , @HyukjinKwon , @gatorsmile , @cloud-fan ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19571 Thank you for review, @viirya and @HyukjinKwon . You know that I tried to introduce new OrcFileFormat in `sql/core` before. But, it is too big for review. According to @cloud-fan 's advice, I'm trying to upgrade the existing one by one in a piece. So far, - We introduced new ORC 1.4.1 dependency - Introduce new Spark SQL ORC parameters and replace Hive constant with new ORC parameters. This is the actual first PR to use read and write using ORC 1.4.1 library. - It reads ORC file only for inferencing schema. - It writes only empty ORC file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19571 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19571 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83030/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19571 **[Test build #83030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83030/testReport)** for PR 19571 at commit [`be7ba9b`](https://github.com/apache/spark/commit/be7ba9b5a9c70519a7fa1b0497955fbba763e2e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19571: [SPARK-15474][SQL] Write and read back non-emtpy schema ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19571 **[Test build #83030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83030/testReport)** for PR 19571 at commit [`be7ba9b`](https://github.com/apache/spark/commit/be7ba9b5a9c70519a7fa1b0497955fbba763e2e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org