Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
@liancheng Thanks! I didn't notice that. I will rerun the benchmark. I've
re-submitted this PR at #13701.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user liancheng commented on the issue:
https://github.com/apache/spark/pull/13371
@viirya One problem in your new benchmark code is that `1 << 50` is
actually very small since it's an `Int`:
```
scala> 1 << 50
res0: Int = 262144
```
Anyway, `1 <<
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/13371
Yea. Since this one was closed by asfgit, I am not sure you can reopen it.
On Wed, Jun 15, 2016 at 7:39 PM -0700, "Liang-Chi Hsieh"
wrote:
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
@yhuai ok. Do you mean I need to create a new PR for this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/13371
Can you add results showing that there are skipped row groups with this
change (and before this patch all row groups are loaded)?
For those results, let's also put them in the description of
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
@liancheng
I rerun the benchmark that excludes the time of writing Parquet file:
test("Benchmark for Parquet") {
val N = 1 << 50
withParquetTable((0 until
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
@liancheng Got it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user liancheng commented on the issue:
https://github.com/apache/spark/pull/13371
Reverted from master and branch-2.0.
@viirya For the benchmark, there are two things:
1. The benchmark also counts Parquet file writing into it, so the real
number should be much
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
@rxin One thing needs to be explain is, because we just have one
configuration to control filter push down, it affects row-based filter push
down and this row-group filter push down.
The
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/13371
And once we have more data, it might make sense to merge this in 2.0!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/13371
To be more clear, please write a proper benchmark that reads data when
filter push down is not useful to compare whether this regress performance for
the non-push-down case. Also make sure the
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/13371
I just talked to @liancheng offline. I don't think we should've merged this
until we have verified there is no performance regression, and we definitely
shouldn't have merged this in 2.0.
Github user liancheng commented on the issue:
https://github.com/apache/spark/pull/13371
@yhuai We used to support row group level filter push-down before
refactoring `HadoopFsRelation` into `FileFormat`, but lost it (by accident I
guess) after the refactoring. So now we only have
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
@yhuai Your step 3 may not work. We are going to filter the row groups for
each parquet file to read in `VectorizedParquetRecordReader`. I think we don't
do anything regarding creating splits?
---
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
@yhuai Parquet also does this filtering at ParquetRecordReader
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/13371
@viirya I took a look at parquet's code. Seems parquet only evaluate row
group level filters when generating splits
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13371
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13371
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60256/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13371
**[Test build #60256 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60256/consoleFull)**
for PR 13371 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13371
**[Test build #60256 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60256/consoleFull)**
for PR 13371 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
The description is updated.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
It is not really a bug fix because without this filtering push-down, the
thing still works. This should be a performance fix. I should modify the
description.
---
If your project is set up for it,
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/13371
Is this a bug fix or performance fix? Sorry I don't really understand after
reading your description.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13371
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13371
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60246/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13371
**[Test build #60246 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60246/consoleFull)**
for PR 13371 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13371
**[Test build #60246 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60246/consoleFull)**
for PR 13371 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
ping @yhuai @rxin @cloud-fan
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
cc @cloud-fan too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
cc @rxin Can you also take a look of this? This is staying for a while too.
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
ping @yhuai again
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
ping @yhuai I've addressed the comments. Please take a look again. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13371
@yhuai I've run a simple benchmark as following:
test("Benchmark for Parquet") {
val N = 1 << 20
val benchmark = new Benchmark("Parquet reader", N)
34 matches
Mail list logo