[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-05-14 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
I'm closing this PR in favor of #21320. That PR deals with simple 
projection and filter queries only. I will submit subsequent PRs for 
aggregation and join queries following the acceptance of #21320.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-05-10 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
BTW I’ve been and am currently traveling with a busy itinerary. I 
haven’t started work on this and probably won’t get to work on it until 
Monday at the very earliest.

> On May 5, 2018, at 8:32 AM, Xiao Li  wrote:
> 
> Yeah. That is fine. Will try to review the relevant PRs ASAP. Please ping 
me. Thanks again!
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
> 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-05-04 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16578
  
Yeah. That is fine. Will try to review the relevant PRs ASAP. Please ping 
me. Thanks again!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-05-04 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
> To ensure the PR and review quality, we normally avoid doing everything 
in a single huge PR. It would be much better if you can cut it to a few smaller 
PRs.

I'll have a go at it. Of course this will rewrite most of the commits, but 
I assume you don't mind that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-05-01 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16578
  
To ensure the PR and review quality, we normally avoid doing everything in 
a single huge PR. It would be much better if you can cut it to a few smaller 
PRs. Both @cloud-fan and I think separating the optimizer rules makes sense. 
WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89794/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #89794 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89794/testReport)**
 for PR 16578 at commit 
[`dd4f2d8`](https://github.com/apache/spark/commit/dd4f2d8829335b9d9e71fead6d0d056d48a9d7e6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest `
  * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2636/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #89794 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89794/testReport)**
 for PR 16578 at commit 
[`dd4f2d8`](https://github.com/apache/spark/commit/dd4f2d8829335b9d9e71fead6d0d056d48a9d7e6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-18 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16578
  
I only looked at the PR description, here are my 2 cents:

Currently column pruning is done with 2 steps in Spark: 1) optimizer 
generates extra `Project` to prune unnecessary columns as bottom as possible, 
to reduce the data size between operators. 2) planner extract required columns 
and push it to data sources. 

The first step is generally useful even if the data source doesn't support 
column pruning, because we can reduce data size between operators(e.g. 
shuffle). I think it's also true for nested column pruning.

We can implement nested pruning with 2 PRs:
1. improve the current column pruning rule(or add a new rule) to prune 
nested columns as bottom as possible
2. improve the planner rule to extract the required nested columns and push 
to parquet.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-10 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16578
  
I will review this huge PR. : )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-04-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16578
  
hi - where are we on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16578
  
Please do the review @gengliangwang @jiangxb1987 . We should support this 
feature in Spark 2.4.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-02 Thread zaycev
Github user zaycev commented on the issue:

https://github.com/apache/spark/pull/16578
  
I observed about 5x better performance in reading a small subset of fields 
of a highly nested parquet table:

master:
https://user-images.githubusercontent.com/283938/36928047-e07e5b52-1e36-11e8-98e4-a614ad7589b6.png;>
https://user-images.githubusercontent.com/283938/36928033-c9a21022-1e36-11e8-81bf-7008e1f40d6f.png;>

master with @mallman patch:
https://user-images.githubusercontent.com/283938/36928037-cdc9ec10-1e36-11e8-8830-5e77c074e4ab.png;>
https://user-images.githubusercontent.com/283938/36928048-e3e15a88-1e36-11e8-8dda-9b384c4a04c8.png;>






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread Gauravshah
Github user Gauravshah commented on the issue:

https://github.com/apache/spark/pull/16578
  
we have back-ported it to 2.2, on production by an average it has saved us 
at least 2x time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread Gauravshah
Github user Gauravshah commented on the issue:

https://github.com/apache/spark/pull/16578
  
@marmbrus can we target it for 2.4 ? need help on reviews. Been in waiting 
state for very long


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87859/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #87859 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87859/testReport)**
 for PR 16578 at commit 
[`27737a0`](https://github.com/apache/spark/commit/27737a07ea39e953add1fab74d877c6543206a29).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest `
  * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #87859 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87859/testReport)**
 for PR 16578 at commit 
[`27737a0`](https://github.com/apache/spark/commit/27737a07ea39e953add1fab74d877c6543206a29).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-03-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1209/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-02-05 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
So if it's not going to be included in `2.3.0` maybe we could change 
`spark.sql.nestedSchemaPruning.enabled` to default `true` ? I hope that this 
time this PR could be finalized at the early stage of `2.4.0` so there would be 
plenty of time to fix any unforseen problems?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
I'd just suggest trying it. Since this PR is a patch for master, please 
message me personally at m...@allman.ms to discuss progress and questions 
on a backport to 2.2. If we get it working, we can post back here with a 
link to a fork.

Thanks for taking this on!

Michael

On Mon, 8 Jan 2018, Gaurav M Shah wrote:

> 
> @mallman do you foresee any issues ? planning to backport it to spark 2.2 
on
> personal fork. will probably make jitpack release
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the
> thread.[AAy4nfrO2mfWXiVObJZERmlMm1J9RH0Qks5tIoHYgaJpZM4LjK0N.gif]
> 
> 
>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-08 Thread Gauravshah
Github user Gauravshah commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman do you foresee any issues ? planning to backport it to spark 2.2 
on personal fork. will probably make jitpack release


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85662/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #85662 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85662/testReport)**
 for PR 16578 at commit 
[`42067c7`](https://github.com/apache/spark/commit/42067c7c91c3fc72d57050d501bf39f1fd777bae).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest `
  * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #85662 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85662/testReport)**
 for PR 16578 at commit 
[`42067c7`](https://github.com/apache/spark/commit/42067c7c91c3fc72d57050d501bf39f1fd777bae).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread VigneshMohan1
Github user VigneshMohan1 commented on the issue:

https://github.com/apache/spark/pull/16578
  
@JoshRosen Can we make this pr to 2.3.0? A lot of people are interested in 
this and this will boost performance in reading parquet nested fields.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread Gauravshah
Github user Gauravshah commented on the issue:

https://github.com/apache/spark/pull/16578
  
@marmbrus can we start the review process ? so that it can make it for the 
next release ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
> However, I am -1 on merging a change this large after branch cut.

It's disappointing, but I agree we can't merge a change this large into a 
branch cut. It will have to wait for 2.3.1 at the earliest or the next major 
release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread marmbrus
Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/16578
  
I agree that this PR needs to be allocated more review bandwidth, and it is 
unfortunate that it has been blocked on that. However, I am -1 on merging a 
change this large after branch cut.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread ianoc
Github user ianoc commented on the issue:

https://github.com/apache/spark/pull/16578
  
Given it has one or two deep review's already, can someone just rubber 
stamp this in a bias to shipping? Its been stalled more or less since July 
waiting on reviewers. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16578
  
We are still merging changes to the 2.3 branch :)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-03 Thread Gauravshah
Github user Gauravshah commented on the issue:

https://github.com/apache/spark/pull/16578
  
@DaimonPl branch 2.3 is already cut, so its at least not making to 2.3 :(


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2018-01-02 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
New year, new review? ;)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-13 Thread Gauravshah
Github user Gauravshah commented on the issue:

https://github.com/apache/spark/pull/16578
  
thank @mallman for rebasing each time. @gatorsmile can you take a look at 
it ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84820/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #84820 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84820/testReport)**
 for PR 16578 at commit 
[`1936c9b`](https://github.com/apache/spark/commit/1936c9b2e4cf4008e5ee7282c6371fc0ca0535bb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest `
  * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #84820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84820/testReport)**
 for PR 16578 at commit 
[`1936c9b`](https://github.com/apache/spark/commit/1936c9b2e4cf4008e5ee7282c6371fc0ca0535bb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84812/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #84812 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84812/testReport)**
 for PR 16578 at commit 
[`a3d5d9b`](https://github.com/apache/spark/commit/a3d5d9b1c9fa5746016ffc7d2e88eda921503f4f).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest `
  * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #84812 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84812/testReport)**
 for PR 16578 at commit 
[`a3d5d9b`](https://github.com/apache/spark/commit/a3d5d9b1c9fa5746016ffc7d2e88eda921503f4f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-12 Thread abhaynahar
Github user abhaynahar commented on the issue:

https://github.com/apache/spark/pull/16578
  
sorry for spamming, but @rxin @marmbrus @ericl @cloud-fan @liancheng can 
you please help taking this forward ? @viirya has reviewed it closely and is 
looking for someone else to review this as well. This patch has a lot of speed 
improvements and we are hoping it can make it in 2.3 release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-11 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16578
  
@abhaynahar I think the reviewers are already included...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-11 Thread abhaynahar
Github user abhaynahar commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya can you please help tag people you think should review ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-10 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16578
  
As I mentioned before, we still don't have enough eyes on this change so 
far. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-09 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16578
  
yes!
sorry about the delay, I think there's a lot of interests in this PR.
@gatorsmile @viirya ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84404/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #84404 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84404/testReport)**
 for PR 16578 at commit 
[`981e53b`](https://github.com/apache/spark/commit/981e53bf6b4b23f790ff0bbd457f54f308441076).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest `
  * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-12-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #84404 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84404/testReport)**
 for PR 16578 at commit 
[`981e53b`](https://github.com/apache/spark/commit/981e53bf6b4b23f790ff0bbd457f54f308441076).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-26 Thread sriramrajendiran
Github user sriramrajendiran commented on the issue:

https://github.com/apache/spark/pull/16578
  
@felixcheung can you help ? we are hoping to see it in 2.3 release. Feature 
underneath a default disabled flag looks safe option.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-21 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
> But I think we still need other eyes on this too.

Agreed.

@rxin can you help rope anyone else in on this? It's a big PR with a bigger 
history, but absent some savaging by another reviewer I believe it is close to 
the finish line.

A lot of people are hoping this can make it into Spark 2.3. It has a huge 
performance impact for some Spark users, as evidenced by comments on this PR 
and VideoAmp's own experience. So I'm hoping we can get this merged to master 
before the 2.3 branch is cut.

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-21 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
> Can you give an example it would fail? We didn't change 
clipParquetSchema, so I think even when pruning happens, why we read a super 
set of the file's schema and cause the exception, according to the comment? We 
won't add new fields but remove existing from the file's schema, right?

(Oddly, Github won't let me reply to this comment in line.)

The situation we've run into is pruning a schema for a query over a 
partitioned Hive table backed by parquet files where some files are missing 
fields specified by the table schema. This can happen, e.g., in schema 
evolution where fields are added to the table over time without rewriting 
existing partitions. In those cases, we've found parquet-mr throws an exception 
if we try to read from that file with table-pruned schema (a superset of that 
file's schema). Therefore, we further clip the pruned schema against each 
file's schema before attempting to read.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84041/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #84041 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84041/testReport)**
 for PR 16578 at commit 
[`75971ae`](https://github.com/apache/spark/commit/75971aed0cec9aa2e0e26b593eb9c01164303f1f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AggregateFieldExtractionPushdownSuite extends SchemaPruningTest `
  * `class JoinFieldExtractionPushdownSuite extends SchemaPruningTest `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #84041 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84041/testReport)**
 for PR 16578 at commit 
[`75971ae`](https://github.com/apache/spark/commit/75971aed0cec9aa2e0e26b593eb9c01164303f1f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-08 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16578
  
I'm going on this again. But I think we still need other eyes on this too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83532/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #83532 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83532/testReport)**
 for PR 16578 at commit 
[`48a509e`](https://github.com/apache/spark/commit/48a509e8602ed44a4a0fd5268d91d917bb8e0748).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #83532 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83532/testReport)**
 for PR 16578 at commit 
[`48a509e`](https://github.com/apache/spark/commit/48a509e8602ed44a4a0fd5268d91d917bb8e0748).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-06 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16578
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-06 Thread Swat123
Github user Swat123 commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya can we close this before we get another set of merge conflicts ?  
Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya Can you please take a look at my latest revisions and replies to 
your comments? Cheers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83387/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #83387 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83387/testReport)**
 for PR 16578 at commit 
[`71c1762`](https://github.com/apache/spark/commit/71c17622261c69503c2a2bd80c769bd664df3d9d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
I can't tell what's causing the build to fail:


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83390/console

Any ideas?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83390/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
> Yeah, I think with a config for this optimization is good.

I added a config switch, `spark.sql.nestedSchemaPruning.enabled`, which 
disables the optimizations if set to `false`. By default it's `true`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83389/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #83387 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83387/testReport)**
 for PR 16578 at commit 
[`71c1762`](https://github.com/apache/spark/commit/71c17622261c69503c2a2bd80c769bd664df3d9d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-01 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16578
  
Yeah, I think with a config for this optimization is good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-01 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16578
  
Thanks @CodingCat

+1 on config switch. I think that would be a good idea.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-11-01 Thread CodingCat
Github user CodingCat commented on the issue:

https://github.com/apache/spark/pull/16578
  
made a simple test in a single-node spark environment 
 
I used a synthetic dataset which is generated as:  (that’s 20M)
 
```scala
import spark.implicits._
import org.apache.spark.{SparkContext, TaskContext}
 
case class Job(title: String, department: String)
 
case class Person(id: Int, name: String, job: Job)
 
(0 until 2000).map(id => Person(id, id.toString, Job(id.toString, 
id.toString))).toDF.write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_test")
```
 
And then I read the directory and write to another place by 
 
```scala
val df = spark.read.parquet("/home/zhunan/parquet_test")

df.select("job.title").write.mode(SaveMode.Overwrite).parquet("/home/zhunan/parquet_out")
```


without patch, it reads 169 MB, with patch, it will read around 86 MB. 

Basically it proves that the PR is working


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-31 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
> I'm reluctant to generalize this PR without practical experience applying 
it to other column-oriented file formats. The only format I'm familiar with and 
have production experience with is Parquet.

I want to expand on what I wrote. A lot of this patch is already 
generalized, e.g. the catalyst code. The tricky part is porting and testing the 
file access code. While columnar formats operate under the same principles, the 
devil is in the details, so to speak. Hence my reluctance to sign off on a 
broad generalization of this patch to other file formats.

BTW, one thing that's occurred to me is the possibility of putting this 
functionality behind a configuration setting for the first one or two releases 
in which it exists. In the case of a bug we've overlooked, the end user can 
disable the optimization. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-31 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
> @mallman I will try to go through this again. Do you think this can be 
generalized to data source v2 API?

I'm not familiar with that API.

I'm reluctant to generalize this PR without practical experience applying 
it to other column-oriented file formats. The only format I'm familiar with and 
have production experience with is Parquet.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-30 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman I will try to go through this again. Do you think this can be 
generalize to data source v2 API?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-30 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16578
  
thanks! ping/add  @rxin @hvanhovell @gatorsmile @cloud-fan @liancheng 
@joseph-torres 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-28 Thread bitcot
Github user bitcot commented on the issue:

https://github.com/apache/spark/pull/16578
  
Thanks @mallman this is very helpful.
@felixcheung @rxin can you please help to take this forward ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-27 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya I've rebased to resolve conflicts. All tests are passing. Can you 
take another look and sign off?

Cheers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83128/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #83128 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83128/testReport)**
 for PR 16578 at commit 
[`7af925b`](https://github.com/apache/spark/commit/7af925bcf44861119ec305ab9631a3511d8e8bbb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16578
  
**[Test build #83128 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83128/testReport)**
 for PR 16578 at commit 
[`7af925b`](https://github.com/apache/spark/commit/7af925bcf44861119ec305ab9631a3511d8e8bbb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-27 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
@DaimonPl I'm going to resolve the merge conflicts shortly. Otherwise, I 
have no intention of making further modifications to this PR outside of further 
review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-26 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman how about finalizing it as is? IMHO performance improvements are 
worth more than (possibly) redundant workaround - it could be cleaned later


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-13 Thread amankothari04
Github user amankothari04 commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya did you get a chance to review this ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-09 Thread DaimonPl
Github user DaimonPl commented on the issue:

https://github.com/apache/spark/pull/16578
  
@mallman @viirya from my understanding current workaround is for case when 
reading columns which are not in file schema

> Parquet-mr will throw an exception if we try to read a superset of the 
file's schema.

Isn't it somehow dependent on schema evolution setting? 
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging

> Since schema merging is a relatively expensive operation, and is not a 
necessity in most cases, we turned it off by default starting from 1.5.0. You 
may enable it by
> * setting data source option mergeSchema to true when reading Parquet 
files (as shown in the examples below), or
> * setting the global SQL option spark.sql.parquet.mergeSchema to true.

Wouldn't it work fine with `spark.sql.parquet.mergeSchema` enabled?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16578
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82383/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >