Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
I'm closing this PR in favor of #21320. That PR deals with simple
projection and filter queries only. I will submit subsequent PRs for
aggregation and join queries following the acceptance of
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
BTW Iâve been and am currently traveling with a busy itinerary. I
havenât started work on this and probably wonât get to work on it until
Monday at the very earliest.
> On May 5,
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16578
Yeah. That is fine. Will try to review the relevant PRs ASAP. Please ping
me. Thanks again!
---
-
To unsubscribe, e-mail:
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
> To ensure the PR and review quality, we normally avoid doing everything
in a single huge PR. It would be much better if you can cut it to a few smaller
PRs.
I'll have a go at it. Of
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16578
To ensure the PR and review quality, we normally avoid doing everything in
a single huge PR. It would be much better if you can cut it to a few smaller
PRs. Both @cloud-fan and I think
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89794/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #89794 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89794/testReport)**
for PR 16578 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2636/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #89794 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89794/testReport)**
for PR 16578 at commit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16578
I only looked at the PR description, here are my 2 cents:
Currently column pruning is done with 2 steps in Spark: 1) optimizer
generates extra `Project` to prune unnecessary columns as
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16578
I will review this huge PR. : )
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/16578
hi - where are we on this?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16578
Please do the review @gengliangwang @jiangxb1987 . We should support this
feature in Spark 2.4.0
---
-
To unsubscribe,
Github user zaycev commented on the issue:
https://github.com/apache/spark/pull/16578
I observed about 5x better performance in reading a small subset of fields
of a highly nested parquet table:
master:
Github user Gauravshah commented on the issue:
https://github.com/apache/spark/pull/16578
we have back-ported it to 2.2, on production by an average it has saved us
at least 2x time.
---
-
To unsubscribe, e-mail:
Github user Gauravshah commented on the issue:
https://github.com/apache/spark/pull/16578
@marmbrus can we target it for 2.4 ? need help on reviews. Been in waiting
state for very long
---
-
To unsubscribe, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87859/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #87859 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87859/testReport)**
for PR 16578 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #87859 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87859/testReport)**
for PR 16578 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1209/
Github user DaimonPl commented on the issue:
https://github.com/apache/spark/pull/16578
So if it's not going to be included in `2.3.0` maybe we could change
`spark.sql.nestedSchemaPruning.enabled` to default `true` ? I hope that this
time this PR could be finalized at the early stage
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
I'd just suggest trying it. Since this PR is a patch for master, please
message me personally at m...@allman.ms to discuss progress and questions
on a backport to 2.2. If we get it working,
Github user Gauravshah commented on the issue:
https://github.com/apache/spark/pull/16578
@mallman do you foresee any issues ? planning to backport it to spark 2.2
on personal fork. will probably make jitpack release
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85662/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #85662 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85662/testReport)**
for PR 16578 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #85662 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85662/testReport)**
for PR 16578 at commit
Github user VigneshMohan1 commented on the issue:
https://github.com/apache/spark/pull/16578
@JoshRosen Can we make this pr to 2.3.0? A lot of people are interested in
this and this will boost performance in reading parquet nested fields.
---
Github user Gauravshah commented on the issue:
https://github.com/apache/spark/pull/16578
@marmbrus can we start the review process ? so that it can make it for the
next release ?
---
-
To unsubscribe, e-mail:
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
> However, I am -1 on merging a change this large after branch cut.
It's disappointing, but I agree we can't merge a change this large into a
branch cut. It will have to wait for 2.3.1 at
Github user marmbrus commented on the issue:
https://github.com/apache/spark/pull/16578
I agree that this PR needs to be allocated more review bandwidth, and it is
unfortunate that it has been blocked on that. However, I am -1 on merging a
change this large after branch cut.
---
Github user ianoc commented on the issue:
https://github.com/apache/spark/pull/16578
Given it has one or two deep review's already, can someone just rubber
stamp this in a bias to shipping? Its been stalled more or less since July
waiting on reviewers.
---
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/16578
We are still merging changes to the 2.3 branch :)
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user Gauravshah commented on the issue:
https://github.com/apache/spark/pull/16578
@DaimonPl branch 2.3 is already cut, so its at least not making to 2.3 :(
---
-
To unsubscribe, e-mail:
Github user DaimonPl commented on the issue:
https://github.com/apache/spark/pull/16578
New year, new review? ;)
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user Gauravshah commented on the issue:
https://github.com/apache/spark/pull/16578
thank @mallman for rebasing each time. @gatorsmile can you take a look at
it ?
---
-
To unsubscribe, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84820/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #84820 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84820/testReport)**
for PR 16578 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #84820 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84820/testReport)**
for PR 16578 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84812/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #84812 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84812/testReport)**
for PR 16578 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #84812 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84812/testReport)**
for PR 16578 at commit
Github user abhaynahar commented on the issue:
https://github.com/apache/spark/pull/16578
sorry for spamming, but @rxin @marmbrus @ericl @cloud-fan @liancheng can
you please help taking this forward ? @viirya has reviewed it closely and is
looking for someone else to review this as
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16578
@abhaynahar I think the reviewers are already included...
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user abhaynahar commented on the issue:
https://github.com/apache/spark/pull/16578
@viirya can you please help tag people you think should review ?
---
-
To unsubscribe, e-mail:
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16578
As I mentioned before, we still don't have enough eyes on this change so
far.
---
-
To unsubscribe, e-mail:
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/16578
yes!
sorry about the delay, I think there's a lot of interests in this PR.
@gatorsmile @viirya ?
---
-
To
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84404/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #84404 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84404/testReport)**
for PR 16578 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #84404 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84404/testReport)**
for PR 16578 at commit
Github user sriramrajendiran commented on the issue:
https://github.com/apache/spark/pull/16578
@felixcheung can you help ? we are hoping to see it in 2.3 release. Feature
underneath a default disabled flag looks safe option.
---
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
> But I think we still need other eyes on this too.
Agreed.
@rxin can you help rope anyone else in on this? It's a big PR with a bigger
history, but absent some savaging by another
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
> Can you give an example it would fail? We didn't change
clipParquetSchema, so I think even when pruning happens, why we read a super
set of the file's schema and cause the exception, according to
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84041/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #84041 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84041/testReport)**
for PR 16578 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #84041 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84041/testReport)**
for PR 16578 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16578
I'm going on this again. But I think we still need other eyes on this too.
---
-
To unsubscribe, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83532/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #83532 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83532/testReport)**
for PR 16578 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #83532 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83532/testReport)**
for PR 16578 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16578
retest this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user Swat123 commented on the issue:
https://github.com/apache/spark/pull/16578
@viirya can we close this before we get another set of merge conflicts ?
Thanks
---
-
To unsubscribe, e-mail:
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
@viirya Can you please take a look at my latest revisions and replies to
your comments? Cheers.
---
-
To unsubscribe, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83387/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #83387 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83387/testReport)**
for PR 16578 at commit
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
I can't tell what's causing the build to fail:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83390/console
Any ideas?
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83390/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
> Yeah, I think with a config for this optimization is good.
I added a config switch, `spark.sql.nestedSchemaPruning.enabled`, which
disables the optimizations if set to `false`. By default
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83389/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #83387 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83387/testReport)**
for PR 16578 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16578
Yeah, I think with a config for this optimization is good.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/16578
Thanks @CodingCat
+1 on config switch. I think that would be a good idea.
---
-
To unsubscribe, e-mail:
Github user CodingCat commented on the issue:
https://github.com/apache/spark/pull/16578
made a simple test in a single-node spark environment
I used a synthetic dataset which is generated as: (thatâs 20M)
```scala
import spark.implicits._
import
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
> I'm reluctant to generalize this PR without practical experience applying
it to other column-oriented file formats. The only format I'm familiar with and
have production experience with is
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
> @mallman I will try to go through this again. Do you think this can be
generalized to data source v2 API?
I'm not familiar with that API.
I'm reluctant to generalize this PR
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16578
@mallman I will try to go through this again. Do you think this can be
generalize to data source v2 API?
---
-
To unsubscribe,
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/16578
thanks! ping/add @rxin @hvanhovell @gatorsmile @cloud-fan @liancheng
@joseph-torres
---
-
To unsubscribe, e-mail:
Github user bitcot commented on the issue:
https://github.com/apache/spark/pull/16578
Thanks @mallman this is very helpful.
@felixcheung @rxin can you please help to take this forward ?
---
-
To unsubscribe,
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
@viirya I've rebased to resolve conflicts. All tests are passing. Can you
take another look and sign off?
Cheers.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83128/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #83128 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83128/testReport)**
for PR 16578 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/16578
**[Test build #83128 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83128/testReport)**
for PR 16578 at commit
Github user mallman commented on the issue:
https://github.com/apache/spark/pull/16578
@DaimonPl I'm going to resolve the merge conflicts shortly. Otherwise, I
have no intention of making further modifications to this PR outside of further
review.
---
Github user DaimonPl commented on the issue:
https://github.com/apache/spark/pull/16578
@mallman how about finalizing it as is? IMHO performance improvements are
worth more than (possibly) redundant workaround - it could be cleaned later
---
Github user amankothari04 commented on the issue:
https://github.com/apache/spark/pull/16578
@viirya did you get a chance to review this ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user DaimonPl commented on the issue:
https://github.com/apache/spark/pull/16578
@mallman @viirya from my understanding current workaround is for case when
reading columns which are not in file schema
> Parquet-mr will throw an exception if we try to read a superset of
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16578
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82383/
Test PASSed.
---
1 - 100 of 253 matches
Mail list logo