[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-24 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 11:

(6 comments)

Thank you Alex for the review. I submitted a change to address your comments 
here: https://gerrit.cloudera.org/#/c/6147/

Let's continue the review there.

http://gerrit.cloudera.org:8080/#/c/6032/11/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 194:   // Allocate tuple buffer to evaluate conjuncts on 
parquet::Statistics.
> Sorry to keep asking you about more JIRAs :), but you could file a JIRA to 
Sure thing, IMPALA-4986


Line 196:   if (min_max_tuple_desc) {
> != nullptr?
Done, do we do that by convention?


Line 496: int col_idx = slot_desc->col_pos() - 
scan_node_->num_partition_keys();
> I think we need to handle the name-based field solution policy here as well
Done


Line 497: DCHECK(col_idx < row_group.columns.size());
> I don't think this DCHECK is right. A Parquet file may legitimately have fe
Using the schema resolver should catch this case now, so I kept the DCHECK.


http://gerrit.cloudera.org:8080/#/c/6032/11/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 223:   DCHECK(col_idx < row_group.columns.size());
> I don't think this DCHECK is right. A Parquet file may legitimately not hav
With the changes in the caller I think it is correct now. Within the scope of 
this method it seems reasonable to me to expect the caller to make this 
guarantee.


http://gerrit.cloudera.org:8080/#/c/6032/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 340:   if (!(binaryPred.getChild(0) instanceof SlotRef)) continue;
> Why not allow cast SlotRefs?
Added support for implicit casts.


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-23 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6032/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 315: SlotRef slot = new SlotRef(slotDesc);
> you're printing the original predicate, not the constructed one.
I prefer to see in the explain plan what's actually being executed (and not a 
projection thereof). But works for me in this case.


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-23 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6032/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 315: SlotRef slot = new SlotRef(slotDesc);
> Would look nicer if the explain plan printed the predicates evaluated again
you're printing the original predicate, not the constructed one.

i prefer that, because it's much easier to relate back to the original query 
('=' would show up as two predicates otherwise), and the actual implementation 
isn't very meaningful to the user.


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-23 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 11:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/6032/11/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 194:   // Allocate tuple buffer to evaluate conjuncts on 
parquet::Statistics.
Sorry to keep asking you about more JIRAs :), but you could file a JIRA to 
evaluate min/max aggregates against Parquet stats?


Line 196:   if (min_max_tuple_desc) {
!= nullptr?


Line 496: int col_idx = slot_desc->col_pos() - 
scan_node_->num_partition_keys();
I think we need to handle the name-based field solution policy here as well.

SET PARQUET_FALLBACK_SCHEMA_RESOLUTION=NAME


Line 497: DCHECK(col_idx < row_group.columns.size());
I don't think this DCHECK is right. A Parquet file may legitimately have fewer 
columns than the table schema.

In these cases we can easily skip the row group because we'd initialize the 
missing slot values to NULL.


http://gerrit.cloudera.org:8080/#/c/6032/11/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 223:   DCHECK(col_idx < row_group.columns.size());
I don't think this DCHECK is right. A Parquet file may legitimately not have 
all columns specified in the table schema.

Also, in these cases we can easily skip the row group because we'd initialize 
the missing slot values to NULL.


http://gerrit.cloudera.org:8080/#/c/6032/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 315: SlotRef slot = new SlotRef(slotDesc);
Would look nicer if the explain plan printed the predicates evaluated against 
the statistics with "max(col)" or "min(col)".

You can do this by giving the slotDesc a custom label with setLabel().


Line 340:   if (!(binaryPred.getChild(0) instanceof SlotRef)) continue;
Why not allow cast SlotRefs?


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 10: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Reviewed-on: http://gerrit.cloudera.org:8080/6032
Reviewed-by: Lars Volker 
Tested-by: Impala Public Jenkins
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
A be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
32 files changed, 979 insertions(+), 162 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Lars Volker: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 10: Code-Review+2

PS10 is a rebase. Carrying Marcel's +2.

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 10:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/295/

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 9:

PS9 addresses a clang-tidy error.

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Lars Volker (Code Review)
Hello Marcel Kornacker, Impala Public Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6032

to look at the new patch set (#9).

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
A be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
32 files changed, 979 insertions(+), 162 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 8: Verified-1

Build failed: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/294/

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 8:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/294/

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 8: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 4:

(1 comment)

Thanks for the review. I addressed the remaining comments in PS8. There's the 
question who calls ColumnStatsBase::ReadFromThrift, let me know if you want me 
to elaborate on it in the comment.

http://gerrit.cloudera.org:8080/#/c/6032/7/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

Line 55:   }
> i thought the stats predicates are visible in extended?
Yes, done.


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Lars Volker (Code Review)
Hello Marcel Kornacker,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6032

to look at the new patch set (#8).

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
A be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
32 files changed, 986 insertions(+), 162 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 31: /// writing parquet files. It provides an interface to populate a 
parquet::Statistics
> i think you can leave the getters out for now, since we won't be calling th
Done


http://gerrit.cloudera.org:8080/#/c/6032/7/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 68:   static bool ReadFromThrift(const parquet::Statistics& thrift_stats,
> who calls this?
This is called by the parquet scanners. It's implemented by dispatching on 
'col_type' and calling ReadFromThrift() in the subclass for the correct C++ 
type T. Should this be explained in the comment?


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 7: Code-Review+2

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 31: /// writing parquet files. It provides an interface to populate a 
parquet::Statistics
> The other two values are i64 null_count and i64 distinct_count, for which I
i think you can leave the getters out for now, since we won't be calling them, 
but for the sake of completeness we should add the other two enums.


http://gerrit.cloudera.org:8080/#/c/6032/7/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 68:   static bool ReadFromThrift(const parquet::Statistics& thrift_stats,
who calls this?


http://gerrit.cloudera.org:8080/#/c/6032/7/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

Line 55: options.setExplain_level(TExplainLevel.VERBOSE);
i thought the stats predicates are visible in extended?


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 6:

(14 comments)

Thanks for the review. Please see PS7.

http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 382:   /// Min/max statistics contexts, owned by instances of this class.
> they are not owned by the class instances themselves, they're allocated in 
Done


Line 386:   /// 'min_max_conjunct_ctxs_', which are used when evaluating 
parquet::Statistics.
> somewhat incomprehensible comment.
Done


Line 476:   Status EvaluateRowGroupStats(const parquet::RowGroup& row_group, 
bool* skip_row_group);
> you're not evaluating stats, you're evaluating predicates on the stats
Done


http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 31: enum class StatsField { MIN, MAX };
> move into ColumnStatsBase so it's clear what stats this is referring to.
The other two values are i64 null_count and i64 distinct_count, for which I 
think we should have proper getters, i.e. we don't need to consider their type 
and method of decoding when reading them. Should we prefer an enum-based getter 
for those over ColumnStatsBase::ReadNullCountFromThrift() and 
ReadDistinctCountFromThrift()?


Line 114:   static bool ReadFromThrift(const parquet::Statistics& thrift_stats, 
const StatsField& stats_field,
> move function bodies into parquet-column-stats-inl.h
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 297: collectionConjuncts_.clear();
> true, but those should be identical, given that we rewrite constants as lit
Constant rewriting is controlled by the enable_expr_rewrites query option, 
changing the code to filter on literals and turning the query option off 
disables statistics for predicates like "a < 3 - 3". There is a test for this 
in parquet_stats.test:L206 that would fail. Let me know if you prefer to switch 
to literals.


http://gerrit.cloudera.org:8080/#/c/6032/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 85:  * conjuncts, that are passed to the backend and will be evaluated 
against the
> "conjuncts that"
Done


Line 154:   private List minMaxPlanConjuncts_ = Lists.newArrayList();
> "plan conjuncts" doesn't really mean anything. original conjuncts?
Done


Line 302:* Builds a predicate to evaluate parquet::Statistics by copying 
'inputSlot' into
> "to evaluate against"
Done


Line 347:   // Only constant exprs can be evaluated against the 
parquet::Statistics. This
> drop "the"
Done


Line 349:   if (!constExpr.isConstant()) continue;
> as pointed out before, you shouldn't see non-literal constants here
See my reply in PS4, constant folding depends on the enable_expr_rewrites query 
option.


http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

Line 1047:runtime filters: RF000 -> c_custkey
> extended is fine.
Done


http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test:

Line 1: # Test HDFS scan node.
> run this file in extended mode so you see the stats predicates.
Done


http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test:

Line 206: # Test that without expr rewrite and thus without constant folding, 
constant exprs still
> what's the point of those tests?
See my comment regarding literal vs. constant exprs in PS4. These tests make 
sure that we don't break support for constant exprs.


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-22 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#7).

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
A be/src/exec/parquet-column-stats.inline.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
32 files changed, 1,121 insertions(+), 283 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-21 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 6:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 382:   /// Min/max statistics contexts, owned by instances of this class.
they are not owned by the class instances themselves, they're allocated in 
state_->obj_pool()


Line 386:   /// 'min_max_conjunct_ctxs_', which are used when evaluating 
parquet::Statistics.
somewhat incomprehensible comment.

this is used only for a specific function, and then only to avoid the dynamic 
allocation overhead. point this out, right now it seems very mysterious.


Line 476:   Status EvaluateRowGroupStats(const parquet::RowGroup& row_group, 
bool* skip_row_group);
you're not evaluating stats, you're evaluating predicates on the stats


http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 31: enum class StatsField { MIN, MAX };
move into ColumnStatsBase so it's clear what stats this is referring to.

also, add enums for all fields in Statistics.


Line 114:   static bool ReadFromThrift(const parquet::Statistics& thrift_stats, 
const StatsField& stats_field,
move function bodies into parquet-column-stats-inl.h


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 297: collectionConjuncts_.clear();
> My reasoning was that any constant expression could be evaluated against th
true, but those should be identical, given that we rewrite constants as 
literals.


http://gerrit.cloudera.org:8080/#/c/6032/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 85:  * conjuncts, that are passed to the backend and will be evaluated 
against the
"conjuncts that"


Line 154:   private List minMaxPlanConjuncts_ = Lists.newArrayList();
"plan conjuncts" doesn't really mean anything. original conjuncts?


Line 302:* Builds a predicate to evaluate parquet::Statistics by copying 
'inputSlot' into
"to evaluate against"


Line 347:   // Only constant exprs can be evaluated against the 
parquet::Statistics. This
drop "the"


Line 349:   if (!constExpr.isConstant()) continue;
as pointed out before, you shouldn't see non-literal constants here


http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

Line 1047:runtime filters: RF000 -> c_custkey
> I changed the code to track the original predicates and display them instea
extended is fine.

yes, "parquet statistics" is more concrete.


http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test:

Line 1: # Test HDFS scan node.
run this file in extended mode so you see the stats predicates.


http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test:

Line 206: # Test that without expr rewrite and thus without constant folding, 
constant exprs still
what's the point of those tests?


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test:

Line 2:  QUERY
> Can you add coverage for filtering on the root level for nested data?
Can you elaborate on how to do this? Wouldn't it only make sense to apply 
filtering on the leave level, since objects at the root level would be complex 
ones? Can you post a sample query and the result you would expect? Thanks.


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-21 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#6).

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
29 files changed, 756 insertions(+), 87 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 4:

(26 comments)

Thanks for the review. Please see PS6.

http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 472:   ScopedBuffer tuple_buffer(scan_node_->mem_tracker());
> no need to allocate this and go through the setup for every single row grou
Done


Line 486:   conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size());
> there's too much memory allocation going on for what are essentially static
Done


Line 496: unique_ptr stats(
> there's no need for memory allocation here
Done


Line 501: if (e->function_name() == "lt" || e->function_name() == "le") {
> introduce named constants (in the appropriate class)
Other code that compares function_name follows the same pattern. As discussed 
in person, let's keep this for now and see if we find a nicer abstraction to 
determine when to use min/max values, e.g. monotonically bound predicates.


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 378:   /// Min/max statistics contexts.
> who owns them?
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

Line 182:   if (min_max_tuple_id_ > -1) {
> ">" or < don't make sense for tuple ids
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 40: /// writing parquet files. It provides an interface to populate a 
parquet::Statistics
> this description isn't true anymore, it's now also an intermediate staging 
I removed the expanded abstraction and added a sentence to the comment 
explaining that it can also be used to decode stats.


Line 70:   /// 'nullptr' for unsupported types. The caller is responsible for 
freeing the memory.
> this should be created in an objectpool
Code has been removed.


Line 87:   virtual void WriteMinToBuffer(void* buffer) const = 0;
> remove "tobuffer", that's clear from the parameters. what's unclear is the 
Done


Line 126: max_value_(max_value) {
> formatting
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 208:   // Conjuncts that can be evaluated against the statistics of 
parquet files. Each
> refer to thrift structs explicitly, where appropriate.
Done


Line 209:   // conjunct references the slot in 'min_max_tuple_id' with the same 
index.
> i didn't understand the last sentence.
Changed it.


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 143:   // List of conjuncts for min/max values, that are used to skip data 
when scanning
> "min/max values" is vague, refer to thrift structs explicitly.
Done


Line 145:   private List minMaxConjunts_ = Lists.newArrayList();
> conjuncts
Done


Line 147:   // Tuple that is used to materialize statistics when scanning 
Parquet files.
> describe what's in it.
Done


Line 290:* Builds a predicate to evaluate statistical values by combining 
'inputSlot',
> make 'statistical values' concrete
Done


Line 291:* 'inputPred' and 'op' into a new predicate, and adding it to 
'minMaxConjunts_'.
> missing other side effects
Done


Line 297: Preconditions.checkState(constExpr.isConstant());
> why not a literal?
My reasoning was that any constant expression could be evaluated against the 
statistics. I added a test to make sure this does indeed work as expected.


Line 303: // Make a new slot from the slot descriptor.
> what do you mean by 'slot'? (slot/slot descriptor are usually synonyms)
I meant slotRef, but the comment looked superfluous and I removed it altogether.


Line 315:* TODO: This method currently doesn't de-dup slots that are 
referenced multiple times.
> resolve
As discussed, lets keep this for now and address it in a subsequent change by 
simplifying redundant exprs. I created IMPALA-4958 to track this.


Line 328:   // We only support slot refs on the left hand side of the 
predicate, a rewriting
> this overlaps with what kuduscannode does, and there's already a similar co
I had talked to Alex about this last week and he suggested to add a general 
rewrite rule in the FE to normalize binary exprs. I removed the branches from 
BinaryPredicate.java that are no longer needed after the normalization. I also 
cleaned up code in KuduScanNode.java that is no longer needed.


Line 343:   {
> { on preceding line
Done


Line 658:   
msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt());
> check that it's not invalid
How can I check this? The tupleId needs to come from the TupleId generator a

[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-21 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#5).

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test
M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
29 files changed, 757 insertions(+), 87 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-21 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test:

Line 2:  QUERY
Can you add coverage for filtering on the root level for nested data?


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Matthew Mulder 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-19 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 4:

(26 comments)

http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 472:   ScopedBuffer tuple_buffer(scan_node_->mem_tracker());
no need to allocate this and go through the setup for every single row group.


Line 486:   conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size());
there's too much memory allocation going on for what are essentially static 
structures.


Line 496: unique_ptr stats(
there's no need for memory allocation here


Line 501: if (e->function_name() == "lt" || e->function_name() == "le") {
introduce named constants (in the appropriate class)


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 378:   /// Min/max statistics contexts.
who owns them?


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

Line 182:   if (min_max_tuple_id_ > -1) {
">" or < don't make sense for tuple ids


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 40: /// writing parquet files. It provides an interface to populate a 
parquet::Statistics
this description isn't true anymore, it's now also an intermediate staging area 
for the min/max values.

i'm not sure this expanded abstraction makes sense (why do you need an 
intermediate staging area?).


Line 70:   /// 'nullptr' for unsupported types. The caller is responsible for 
freeing the memory.
this should be created in an objectpool


Line 87:   virtual void WriteMinToBuffer(void* buffer) const = 0;
remove "tobuffer", that's clear from the parameters. what's unclear is the 
format of the written data. is the void* simply a slot? 'buffer' is too generic.


Line 126: max_value_(max_value) {
formatting

it looks like you're populating an object of this class and then write out the 
min/max to the corresponding slots. why not go directly from 
parquet::statistics to the slots?


http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 208:   // Conjuncts that can be evaluated against the statistics of 
parquet files. Each
refer to thrift structs explicitly, where appropriate.


Line 209:   // conjunct references the slot in 'min_max_tuple_id' with the same 
index.
i didn't understand the last sentence.


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 143:   // List of conjuncts for min/max values, that are used to skip data 
when scanning
"min/max values" is vague, refer to thrift structs explicitly.


Line 145:   private List minMaxConjunts_ = Lists.newArrayList();
conjuncts


Line 147:   // Tuple that is used to materialize statistics when scanning 
Parquet files.
describe what's in it.

i should have an understanding of the overall approach just based on reading 
the comments for the related data structures. this might best fit into the 
class header.


Line 290:* Builds a predicate to evaluate statistical values by combining 
'inputSlot',
make 'statistical values' concrete


Line 291:* 'inputPred' and 'op' into a new predicate, and adding it to 
'minMaxConjunts_'.
missing other side effects


Line 297: Preconditions.checkState(constExpr.isConstant());
why not a literal?


Line 303: // Make a new slot from the slot descriptor.
what do you mean by 'slot'? (slot/slot descriptor are usually synonyms)


Line 315:* TODO: This method currently doesn't de-dup slots that are 
referenced multiple times.
resolve


Line 328:   // We only support slot refs on the left hand side of the 
predicate, a rewriting
this overlaps with what kuduscannode does, and there's already a similar 
concept in the form of binarypredicate.getboundslot. figure out the similarity 
(might require a new concept) and move the functionality into binarypredicate.


Line 343:   {
{ on preceding line


Line 658:   
msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt());
check that it's not invalid


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
File 
fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java:

Line 35: public class NormalizeBinaryPredicatesRule implements ExprRewriteRule {
add tests for rule


Line 40: if (!expr.isAnalyzed()) return expr;
why do you need this?


http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/a

[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-17 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 4:

PS4 adds the "statistics predicates" to the PlannerTest and removes the custom 
predicate normalization from KuduScanNode.java, after it is now done during 
query rewrite.

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-17 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#4).

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/empty.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/predicate-propagation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
27 files changed, 729 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-16 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 3:

PS3 adds the statistics predicates to EXPLAIN output and adds tests for that.

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-16 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#3).

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/rewrite/NormalizeExprsRule.java
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
15 files changed, 615 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-16 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..


Patch Set 2:

PS2 adds error handling when deserializing statistics

-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-16 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#2).

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/rewrite/NormalizeExprsRule.java
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
15 files changed, 599 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 


[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

2017-02-15 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6032

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
..

IMPALA-2328: Read support for min/max Parquet statistics

This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the formsor
  , where  is LT, LE, GE, GT, and EQ.

Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-metadata-utils.cc
M be/src/exec/parquet-metadata-utils.h
M be/src/exprs/expr.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/rewrite/NormalizeExprsRule.java
A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test
M tests/query_test/test_insert_parquet.py
15 files changed, 589 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker