[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 @liancheng Thanks! I didn't notice that. I will rerun the benchmark. I've re-submitted this PR at #13701. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread liancheng
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13371 @viirya One problem in your new benchmark code is that `1 << 50` is actually very small since it's an `Int`: ``` scala> 1 << 50 res0: Int = 262144 ``` Anyway, `1 <<

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-15 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13371 Yea. Since this one was closed by asfgit, I am not sure you can reopen it. On Wed, Jun 15, 2016 at 7:39 PM -0700, "Liang-Chi Hsieh" wrote:

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-15 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 @yhuai ok. Do you mean I need to create a new PR for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-14 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13371 Can you add results showing that there are skipped row groups with this change (and before this patch all row groups are loaded)? For those results, let's also put them in the description of

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-14 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 @liancheng I rerun the benchmark that excludes the time of writing Parquet file: test("Benchmark for Parquet") { val N = 1 << 50 withParquetTable((0 until

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-10 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 @liancheng Got it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so,

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-10 Thread liancheng
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13371 Reverted from master and branch-2.0. @viirya For the benchmark, there are two things: 1. The benchmark also counts Parquet file writing into it, so the real number should be much

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-10 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 @rxin One thing needs to be explain is, because we just have one configuration to control filter push down, it affects row-based filter push down and this row-group filter push down. The

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-10 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13371 And once we have more data, it might make sense to merge this in 2.0! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-10 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13371 To be more clear, please write a proper benchmark that reads data when filter push down is not useful to compare whether this regress performance for the non-push-down case. Also make sure the

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-10 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13371 I just talked to @liancheng offline. I don't think we should've merged this until we have verified there is no performance regression, and we definitely shouldn't have merged this in 2.0.

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-10 Thread liancheng
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13371 @yhuai We used to support row group level filter push-down before refactoring `HadoopFsRelation` into `FileFormat`, but lost it (by accident I guess) after the refactoring. So now we only have

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 @yhuai Your step 3 may not work. We are going to filter the row groups for each parquet file to read in `VectorizedParquetRecordReader`. I think we don't do anything regarding creating splits? ---

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 @yhuai Parquet also does this filtering at ParquetRecordReader

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13371 @viirya I took a look at parquet's code. Seems parquet only evaluate row group level filters when generating splits

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13371 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13371 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60256/ Test PASSed. ---

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13371 **[Test build #60256 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60256/consoleFull)** for PR 13371 at commit

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13371 **[Test build #60256 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60256/consoleFull)** for PR 13371 at commit

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 The description is updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so,

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 It is not really a bug fix because without this filtering push-down, the thing still works. This should be a performance fix. I should modify the description. --- If your project is set up for it,

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13371 Is this a bug fix or performance fix? Sorry I don't really understand after reading your description. --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13371 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13371 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60246/ Test FAILed. ---

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13371 **[Test build #60246 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60246/consoleFull)** for PR 13371 at commit

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13371 **[Test build #60246 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60246/consoleFull)** for PR 13371 at commit

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-09 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 ping @yhuai @rxin @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-08 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 cc @cloud-fan too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so,

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 cc @rxin Can you also take a look of this? This is staying for a while too. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well.

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-06 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 ping @yhuai again --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-03 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 ping @yhuai I've addressed the comments. Please take a look again. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #13371: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-02 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13371 @yhuai I've run a simple benchmark as following: test("Benchmark for Parquet") { val N = 1 << 20 val benchmark = new Benchmark("Parquet reader", N)