[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 I'm closing this PR in favor of #21320. That PR deals with simple projection and filter queries only. I will submit subsequent PRs for aggregation and join queries following the acceptance of #21320. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 BTW Iâve been and am currently traveling with a busy itinerary. I havenât started work on this and probably wonât get to work on it until Monday at the very earliest. > On May 5, 2018, at 8:32 AM, Xiao Liwrote: > > Yeah. That is fine. Will try to review the relevant PRs ASAP. Please ping me. Thanks again! > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or mute the thread. > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16578 Yeah. That is fine. Will try to review the relevant PRs ASAP. Please ping me. Thanks again! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > To ensure the PR and review quality, we normally avoid doing everything in a single huge PR. It would be much better if you can cut it to a few smaller PRs. I'll have a go at it. Of course this will rewrite most of the commits, but I assume you don't mind that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16578 To ensure the PR and review quality, we normally avoid doing everything in a single huge PR. It would be much better if you can cut it to a few smaller PRs. Both @cloud-fan and I think separating the optimizer rules makes sense. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89794/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #89794 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89794/testReport)** for PR 16578 at commit [`dd4f2d8`](https://github.com/apache/spark/commit/dd4f2d8829335b9d9e71fead6d0d056d48a9d7e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest ` * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2636/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #89794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89794/testReport)** for PR 16578 at commit [`dd4f2d8`](https://github.com/apache/spark/commit/dd4f2d8829335b9d9e71fead6d0d056d48a9d7e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16578 I only looked at the PR description, here are my 2 cents: Currently column pruning is done with 2 steps in Spark: 1) optimizer generates extra `Project` to prune unnecessary columns as bottom as possible, to reduce the data size between operators. 2) planner extract required columns and push it to data sources. The first step is generally useful even if the data source doesn't support column pruning, because we can reduce data size between operators(e.g. shuffle). I think it's also true for nested column pruning. We can implement nested pruning with 2 PRs: 1. improve the current column pruning rule(or add a new rule) to prune nested columns as bottom as possible 2. improve the planner rule to extract the required nested columns and push to parquet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16578 I will review this huge PR. : ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16578 hi - where are we on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16578 Please do the review @gengliangwang @jiangxb1987 . We should support this feature in Spark 2.4.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user zaycev commented on the issue: https://github.com/apache/spark/pull/16578 I observed about 5x better performance in reading a small subset of fields of a highly nested parquet table: master: https://user-images.githubusercontent.com/283938/36928047-e07e5b52-1e36-11e8-98e4-a614ad7589b6.png;> https://user-images.githubusercontent.com/283938/36928033-c9a21022-1e36-11e8-81bf-7008e1f40d6f.png;> master with @mallman patch: https://user-images.githubusercontent.com/283938/36928037-cdc9ec10-1e36-11e8-8830-5e77c074e4ab.png;> https://user-images.githubusercontent.com/283938/36928048-e3e15a88-1e36-11e8-8dda-9b384c4a04c8.png;> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 we have back-ported it to 2.2, on production by an average it has saved us at least 2x time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @marmbrus can we target it for 2.4 ? need help on reviews. Been in waiting state for very long --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87859/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #87859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87859/testReport)** for PR 16578 at commit [`27737a0`](https://github.com/apache/spark/commit/27737a07ea39e953add1fab74d877c6543206a29). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest ` * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #87859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87859/testReport)** for PR 16578 at commit [`27737a0`](https://github.com/apache/spark/commit/27737a07ea39e953add1fab74d877c6543206a29). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1209/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 So if it's not going to be included in `2.3.0` maybe we could change `spark.sql.nestedSchemaPruning.enabled` to default `true` ? I hope that this time this PR could be finalized at the early stage of `2.4.0` so there would be plenty of time to fix any unforseen problems? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 I'd just suggest trying it. Since this PR is a patch for master, please message me personally at m...@allman.ms to discuss progress and questions on a backport to 2.2. If we get it working, we can post back here with a link to a fork. Thanks for taking this on! Michael On Mon, 8 Jan 2018, Gaurav M Shah wrote: > > @mallman do you foresee any issues ? planning to backport it to spark 2.2 on > personal fork. will probably make jitpack release > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or mute the > thread.[AAy4nfrO2mfWXiVObJZERmlMm1J9RH0Qks5tIoHYgaJpZM4LjK0N.gif] > > > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @mallman do you foresee any issues ? planning to backport it to spark 2.2 on personal fork. will probably make jitpack release --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85662/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #85662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85662/testReport)** for PR 16578 at commit [`42067c7`](https://github.com/apache/spark/commit/42067c7c91c3fc72d57050d501bf39f1fd777bae). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest ` * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #85662 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85662/testReport)** for PR 16578 at commit [`42067c7`](https://github.com/apache/spark/commit/42067c7c91c3fc72d57050d501bf39f1fd777bae). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user VigneshMohan1 commented on the issue: https://github.com/apache/spark/pull/16578 @JoshRosen Can we make this pr to 2.3.0? A lot of people are interested in this and this will boost performance in reading parquet nested fields. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @marmbrus can we start the review process ? so that it can make it for the next release ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > However, I am -1 on merging a change this large after branch cut. It's disappointing, but I agree we can't merge a change this large into a branch cut. It will have to wait for 2.3.1 at the earliest or the next major release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/16578 I agree that this PR needs to be allocated more review bandwidth, and it is unfortunate that it has been blocked on that. However, I am -1 on merging a change this large after branch cut. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user ianoc commented on the issue: https://github.com/apache/spark/pull/16578 Given it has one or two deep review's already, can someone just rubber stamp this in a bias to shipping? Its been stalled more or less since July waiting on reviewers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16578 We are still merging changes to the 2.3 branch :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 @DaimonPl branch 2.3 is already cut, so its at least not making to 2.3 :( --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 New year, new review? ;) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user Gauravshah commented on the issue: https://github.com/apache/spark/pull/16578 thank @mallman for rebasing each time. @gatorsmile can you take a look at it ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84820/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84820/testReport)** for PR 16578 at commit [`1936c9b`](https://github.com/apache/spark/commit/1936c9b2e4cf4008e5ee7282c6371fc0ca0535bb). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest ` * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84820/testReport)** for PR 16578 at commit [`1936c9b`](https://github.com/apache/spark/commit/1936c9b2e4cf4008e5ee7282c6371fc0ca0535bb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84812/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84812/testReport)** for PR 16578 at commit [`a3d5d9b`](https://github.com/apache/spark/commit/a3d5d9b1c9fa5746016ffc7d2e88eda921503f4f). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest ` * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84812/testReport)** for PR 16578 at commit [`a3d5d9b`](https://github.com/apache/spark/commit/a3d5d9b1c9fa5746016ffc7d2e88eda921503f4f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user abhaynahar commented on the issue: https://github.com/apache/spark/pull/16578 sorry for spamming, but @rxin @marmbrus @ericl @cloud-fan @liancheng can you please help taking this forward ? @viirya has reviewed it closely and is looking for someone else to review this as well. This patch has a lot of speed improvements and we are hoping it can make it in 2.3 release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16578 @abhaynahar I think the reviewers are already included... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user abhaynahar commented on the issue: https://github.com/apache/spark/pull/16578 @viirya can you please help tag people you think should review ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16578 As I mentioned before, we still don't have enough eyes on this change so far. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16578 yes! sorry about the delay, I think there's a lot of interests in this PR. @gatorsmile @viirya ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84404/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84404/testReport)** for PR 16578 at commit [`981e53b`](https://github.com/apache/spark/commit/981e53bf6b4b23f790ff0bbd457f54f308441076). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest ` * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84404/testReport)** for PR 16578 at commit [`981e53b`](https://github.com/apache/spark/commit/981e53bf6b4b23f790ff0bbd457f54f308441076). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user sriramrajendiran commented on the issue: https://github.com/apache/spark/pull/16578 @felixcheung can you help ? we are hoping to see it in 2.3 release. Feature underneath a default disabled flag looks safe option. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > But I think we still need other eyes on this too. Agreed. @rxin can you help rope anyone else in on this? It's a big PR with a bigger history, but absent some savaging by another reviewer I believe it is close to the finish line. A lot of people are hoping this can make it into Spark 2.3. It has a huge performance impact for some Spark users, as evidenced by comments on this PR and VideoAmp's own experience. So I'm hoping we can get this merged to master before the 2.3 branch is cut. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > Can you give an example it would fail? We didn't change clipParquetSchema, so I think even when pruning happens, why we read a super set of the file's schema and cause the exception, according to the comment? We won't add new fields but remove existing from the file's schema, right? (Oddly, Github won't let me reply to this comment in line.) The situation we've run into is pruning a schema for a query over a partitioned Hive table backed by parquet files where some files are missing fields specified by the table schema. This can happen, e.g., in schema evolution where fields are added to the table over time without rewriting existing partitions. In those cases, we've found parquet-mr throws an exception if we try to read from that file with table-pruned schema (a superset of that file's schema). Therefore, we further clip the pruned schema against each file's schema before attempting to read. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84041/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84041 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84041/testReport)** for PR 16578 at commit [`75971ae`](https://github.com/apache/spark/commit/75971aed0cec9aa2e0e26b593eb9c01164303f1f). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest ` * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #84041 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84041/testReport)** for PR 16578 at commit [`75971ae`](https://github.com/apache/spark/commit/75971aed0cec9aa2e0e26b593eb9c01164303f1f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16578 I'm going on this again. But I think we still need other eyes on this too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83532/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #83532 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83532/testReport)** for PR 16578 at commit [`48a509e`](https://github.com/apache/spark/commit/48a509e8602ed44a4a0fd5268d91d917bb8e0748). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #83532 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83532/testReport)** for PR 16578 at commit [`48a509e`](https://github.com/apache/spark/commit/48a509e8602ed44a4a0fd5268d91d917bb8e0748). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16578 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user Swat123 commented on the issue: https://github.com/apache/spark/pull/16578 @viirya can we close this before we get another set of merge conflicts ? Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 @viirya Can you please take a look at my latest revisions and replies to your comments? Cheers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83387/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #83387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83387/testReport)** for PR 16578 at commit [`71c1762`](https://github.com/apache/spark/commit/71c17622261c69503c2a2bd80c769bd664df3d9d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 I can't tell what's causing the build to fail: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83390/console Any ideas? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83390/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > Yeah, I think with a config for this optimization is good. I added a config switch, `spark.sql.nestedSchemaPruning.enabled`, which disables the optimizations if set to `false`. By default it's `true`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83389/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #83387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83387/testReport)** for PR 16578 at commit [`71c1762`](https://github.com/apache/spark/commit/71c17622261c69503c2a2bd80c769bd664df3d9d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16578 Yeah, I think with a config for this optimization is good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16578 Thanks @CodingCat +1 on config switch. I think that would be a good idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user CodingCat commented on the issue: https://github.com/apache/spark/pull/16578 made a simple test in a single-node spark environment I used a synthetic dataset which is generated as: (thatâs 20M) ```scala import spark.implicits._ import org.apache.spark.{SparkContext, TaskContext} case class Job(title: String, department: String) case class Person(id: Int, name: String, job: Job) (0 until 2000).map(id => Person(id, id.toString, Job(id.toString, id.toString))).toDF.write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_test") ``` And then I read the directory and write to another place by ```scala val df = spark.read.parquet("/home/zhunan/parquet_test") df.select("job.title").write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_out") ``` without patch, it reads 169 MB, with patch, it will read around 86 MB. Basically it proves that the PR is working --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > I'm reluctant to generalize this PR without practical experience applying it to other column-oriented file formats. The only format I'm familiar with and have production experience with is Parquet. I want to expand on what I wrote. A lot of this patch is already generalized, e.g. the catalyst code. The tricky part is porting and testing the file access code. While columnar formats operate under the same principles, the devil is in the details, so to speak. Hence my reluctance to sign off on a broad generalization of this patch to other file formats. BTW, one thing that's occurred to me is the possibility of putting this functionality behind a configuration setting for the first one or two releases in which it exists. In the case of a bug we've overlooked, the end user can disable the optimization. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 > @mallman I will try to go through this again. Do you think this can be generalized to data source v2 API? I'm not familiar with that API. I'm reluctant to generalize this PR without practical experience applying it to other column-oriented file formats. The only format I'm familiar with and have production experience with is Parquet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16578 @mallman I will try to go through this again. Do you think this can be generalize to data source v2 API? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16578 thanks! ping/add @rxin @hvanhovell @gatorsmile @cloud-fan @liancheng @joseph-torres --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user bitcot commented on the issue: https://github.com/apache/spark/pull/16578 Thanks @mallman this is very helpful. @felixcheung @rxin can you please help to take this forward ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 @viirya I've rebased to resolve conflicts. All tests are passing. Can you take another look and sign off? Cheers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83128/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #83128 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83128/testReport)** for PR 16578 at commit [`7af925b`](https://github.com/apache/spark/commit/7af925bcf44861119ec305ab9631a3511d8e8bbb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16578 **[Test build #83128 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83128/testReport)** for PR 16578 at commit [`7af925b`](https://github.com/apache/spark/commit/7af925bcf44861119ec305ab9631a3511d8e8bbb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 @DaimonPl I'm going to resolve the merge conflicts shortly. Otherwise, I have no intention of making further modifications to this PR outside of further review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 @mallman how about finalizing it as is? IMHO performance improvements are worth more than (possibly) redundant workaround - it could be cleaned later --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user amankothari04 commented on the issue: https://github.com/apache/spark/pull/16578 @viirya did you get a chance to review this ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user DaimonPl commented on the issue: https://github.com/apache/spark/pull/16578 @mallman @viirya from my understanding current workaround is for case when reading columns which are not in file schema > Parquet-mr will throw an exception if we try to read a superset of the file's schema. Isn't it somehow dependent on schema evolution setting? http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging > Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by > * setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or > * setting the global SQL option spark.sql.parquet.mergeSchema to true. Wouldn't it work fine with `spark.sql.parquet.mergeSchema` enabled? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16578 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82383/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org