[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 11: (6 comments) Thank you Alex for the review. I submitted a change to address your comments here: https://gerrit.cloudera.org/#/c/6147/ Let's continue the review there. http://gerrit.cloudera.org:8080/#/c/6032/11/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 194: // Allocate tuple buffer to evaluate conjuncts on parquet::Statistics. > Sorry to keep asking you about more JIRAs :), but you could file a JIRA to Sure thing, IMPALA-4986 Line 196: if (min_max_tuple_desc) { > != nullptr? Done, do we do that by convention? Line 496: int col_idx = slot_desc->col_pos() - scan_node_->num_partition_keys(); > I think we need to handle the name-based field solution policy here as well Done Line 497: DCHECK(col_idx < row_group.columns.size()); > I don't think this DCHECK is right. A Parquet file may legitimately have fe Using the schema resolver should catch this case now, so I kept the DCHECK. http://gerrit.cloudera.org:8080/#/c/6032/11/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: Line 223: DCHECK(col_idx < row_group.columns.size()); > I don't think this DCHECK is right. A Parquet file may legitimately not hav With the changes in the caller I think it is correct now. Within the scope of this method it seems reasonable to me to expect the caller to make this guarantee. http://gerrit.cloudera.org:8080/#/c/6032/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 340: if (!(binaryPred.getChild(0) instanceof SlotRef)) continue; > Why not allow cast SlotRefs? Added support for implicit casts. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Alex Behm has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/6032/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 315: SlotRef slot = new SlotRef(slotDesc); > you're printing the original predicate, not the constructed one. I prefer to see in the explain plan what's actually being executed (and not a projection thereof). But works for me in this case. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/6032/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 315: SlotRef slot = new SlotRef(slotDesc); > Would look nicer if the explain plan printed the predicates evaluated again you're printing the original predicate, not the constructed one. i prefer that, because it's much easier to relate back to the original query ('=' would show up as two predicates otherwise), and the actual implementation isn't very meaningful to the user. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Alex Behm has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 11: (7 comments) http://gerrit.cloudera.org:8080/#/c/6032/11/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 194: // Allocate tuple buffer to evaluate conjuncts on parquet::Statistics. Sorry to keep asking you about more JIRAs :), but you could file a JIRA to evaluate min/max aggregates against Parquet stats? Line 196: if (min_max_tuple_desc) { != nullptr? Line 496: int col_idx = slot_desc->col_pos() - scan_node_->num_partition_keys(); I think we need to handle the name-based field solution policy here as well. SET PARQUET_FALLBACK_SCHEMA_RESOLUTION=NAME Line 497: DCHECK(col_idx < row_group.columns.size()); I don't think this DCHECK is right. A Parquet file may legitimately have fewer columns than the table schema. In these cases we can easily skip the row group because we'd initialize the missing slot values to NULL. http://gerrit.cloudera.org:8080/#/c/6032/11/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: Line 223: DCHECK(col_idx < row_group.columns.size()); I don't think this DCHECK is right. A Parquet file may legitimately not have all columns specified in the table schema. Also, in these cases we can easily skip the row group because we'd initialize the missing slot values to NULL. http://gerrit.cloudera.org:8080/#/c/6032/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 315: SlotRef slot = new SlotRef(slotDesc); Would look nicer if the explain plan printed the predicates evaluated against the statistics with "max(col)" or "min(col)". You can do this by giving the slotDesc a custom label with setLabel(). Line 340: if (!(binaryPred.getChild(0) instanceof SlotRef)) continue; Why not allow cast SlotRefs? -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 10: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Impala Public Jenkins has submitted this change and it was merged. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Reviewed-on: http://gerrit.cloudera.org:8080/6032 Reviewed-by: Lars Volker Tested-by: Impala Public Jenkins --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h A be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test M testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 32 files changed, 979 insertions(+), 162 deletions(-) Approvals: Impala Public Jenkins: Verified Lars Volker: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 10: Code-Review+2 PS10 is a rebase. Carrying Marcel's +2. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 10: Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/295/ -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 9: PS9 addresses a clang-tidy error. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Hello Marcel Kornacker, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6032 to look at the new patch set (#9). Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h A be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test M testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 32 files changed, 979 insertions(+), 162 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/9 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 8: Verified-1 Build failed: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/294/ -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 8: Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/294/ -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 8: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 4: (1 comment) Thanks for the review. I addressed the remaining comments in PS8. There's the question who calls ColumnStatsBase::ReadFromThrift, let me know if you want me to elaborate on it in the comment. http://gerrit.cloudera.org:8080/#/c/6032/7/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: Line 55: } > i thought the stats predicates are visible in extended? Yes, done. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Hello Marcel Kornacker, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6032 to look at the new patch set (#8). Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h A be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test M testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 32 files changed, 986 insertions(+), 162 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/8 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 31: /// writing parquet files. It provides an interface to populate a parquet::Statistics > i think you can leave the getters out for now, since we won't be calling th Done http://gerrit.cloudera.org:8080/#/c/6032/7/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 68: static bool ReadFromThrift(const parquet::Statistics& thrift_stats, > who calls this? This is called by the parquet scanners. It's implemented by dispatching on 'col_type' and calling ReadFromThrift() in the subclass for the correct C++ type T. Should this be explained in the comment? -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 7: Code-Review+2 (3 comments) http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 31: /// writing parquet files. It provides an interface to populate a parquet::Statistics > The other two values are i64 null_count and i64 distinct_count, for which I i think you can leave the getters out for now, since we won't be calling them, but for the sake of completeness we should add the other two enums. http://gerrit.cloudera.org:8080/#/c/6032/7/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 68: static bool ReadFromThrift(const parquet::Statistics& thrift_stats, who calls this? http://gerrit.cloudera.org:8080/#/c/6032/7/fe/src/test/java/org/apache/impala/planner/PlannerTest.java File fe/src/test/java/org/apache/impala/planner/PlannerTest.java: Line 55: options.setExplain_level(TExplainLevel.VERBOSE); i thought the stats predicates are visible in extended? -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 6: (14 comments) Thanks for the review. Please see PS7. http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 382: /// Min/max statistics contexts, owned by instances of this class. > they are not owned by the class instances themselves, they're allocated in Done Line 386: /// 'min_max_conjunct_ctxs_', which are used when evaluating parquet::Statistics. > somewhat incomprehensible comment. Done Line 476: Status EvaluateRowGroupStats(const parquet::RowGroup& row_group, bool* skip_row_group); > you're not evaluating stats, you're evaluating predicates on the stats Done http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 31: enum class StatsField { MIN, MAX }; > move into ColumnStatsBase so it's clear what stats this is referring to. The other two values are i64 null_count and i64 distinct_count, for which I think we should have proper getters, i.e. we don't need to consider their type and method of decoding when reading them. Should we prefer an enum-based getter for those over ColumnStatsBase::ReadNullCountFromThrift() and ReadDistinctCountFromThrift()? Line 114: static bool ReadFromThrift(const parquet::Statistics& thrift_stats, const StatsField& stats_field, > move function bodies into parquet-column-stats-inl.h Done http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 297: collectionConjuncts_.clear(); > true, but those should be identical, given that we rewrite constants as lit Constant rewriting is controlled by the enable_expr_rewrites query option, changing the code to filter on literals and turning the query option off disables statistics for predicates like "a < 3 - 3". There is a test for this in parquet_stats.test:L206 that would fail. Let me know if you prefer to switch to literals. http://gerrit.cloudera.org:8080/#/c/6032/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 85: * conjuncts, that are passed to the backend and will be evaluated against the > "conjuncts that" Done Line 154: private List minMaxPlanConjuncts_ = Lists.newArrayList(); > "plan conjuncts" doesn't really mean anything. original conjuncts? Done Line 302:* Builds a predicate to evaluate parquet::Statistics by copying 'inputSlot' into > "to evaluate against" Done Line 347: // Only constant exprs can be evaluated against the parquet::Statistics. This > drop "the" Done Line 349: if (!constExpr.isConstant()) continue; > as pointed out before, you shouldn't see non-literal constants here See my reply in PS4, constant folding depends on the enable_expr_rewrites query option. http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test: Line 1047:runtime filters: RF000 -> c_custkey > extended is fine. Done http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test File testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test: Line 1: # Test HDFS scan node. > run this file in extended mode so you see the stats predicates. Done http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test: Line 206: # Test that without expr rewrite and thus without constant folding, constant exprs still > what's the point of those tests? See my comment regarding literal vs. constant exprs in PS4. These tests make sure that we don't break support for constant exprs. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has uploaded a new patch set (#7). Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h A be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test M testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 32 files changed, 1,121 insertions(+), 283 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/7 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 6: (14 comments) http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 382: /// Min/max statistics contexts, owned by instances of this class. they are not owned by the class instances themselves, they're allocated in state_->obj_pool() Line 386: /// 'min_max_conjunct_ctxs_', which are used when evaluating parquet::Statistics. somewhat incomprehensible comment. this is used only for a specific function, and then only to avoid the dynamic allocation overhead. point this out, right now it seems very mysterious. Line 476: Status EvaluateRowGroupStats(const parquet::RowGroup& row_group, bool* skip_row_group); you're not evaluating stats, you're evaluating predicates on the stats http://gerrit.cloudera.org:8080/#/c/6032/6/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 31: enum class StatsField { MIN, MAX }; move into ColumnStatsBase so it's clear what stats this is referring to. also, add enums for all fields in Statistics. Line 114: static bool ReadFromThrift(const parquet::Statistics& thrift_stats, const StatsField& stats_field, move function bodies into parquet-column-stats-inl.h http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 297: collectionConjuncts_.clear(); > My reasoning was that any constant expression could be evaluated against th true, but those should be identical, given that we rewrite constants as literals. http://gerrit.cloudera.org:8080/#/c/6032/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 85: * conjuncts, that are passed to the backend and will be evaluated against the "conjuncts that" Line 154: private List minMaxPlanConjuncts_ = Lists.newArrayList(); "plan conjuncts" doesn't really mean anything. original conjuncts? Line 302:* Builds a predicate to evaluate parquet::Statistics by copying 'inputSlot' into "to evaluate against" Line 347: // Only constant exprs can be evaluated against the parquet::Statistics. This drop "the" Line 349: if (!constExpr.isConstant()) continue; as pointed out before, you shouldn't see non-literal constants here http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test: Line 1047:runtime filters: RF000 -> c_custkey > I changed the code to track the original predicates and display them instea extended is fine. yes, "parquet statistics" is more concrete. http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test File testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test: Line 1: # Test HDFS scan node. run this file in extended mode so you see the stats predicates. http://gerrit.cloudera.org:8080/#/c/6032/6/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test: Line 206: # Test that without expr rewrite and thus without constant folding, constant exprs still what's the point of those tests? -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test: Line 2: QUERY > Can you add coverage for filtering on the root level for nested data? Can you elaborate on how to do this? Wouldn't it only make sense to apply filtering on the leave level, since objects at the root level would be complex ones? Can you post a sample query and the result you would expect? Thanks. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has uploaded a new patch set (#6). Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test M testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 29 files changed, 756 insertions(+), 87 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/6 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 4: (26 comments) Thanks for the review. Please see PS6. http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 472: ScopedBuffer tuple_buffer(scan_node_->mem_tracker()); > no need to allocate this and go through the setup for every single row grou Done Line 486: conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size()); > there's too much memory allocation going on for what are essentially static Done Line 496: unique_ptr stats( > there's no need for memory allocation here Done Line 501: if (e->function_name() == "lt" || e->function_name() == "le") { > introduce named constants (in the appropriate class) Other code that compares function_name follows the same pattern. As discussed in person, let's keep this for now and see if we find a nicer abstraction to determine when to use min/max values, e.g. monotonically bound predicates. http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 378: /// Min/max statistics contexts. > who owns them? Done http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: Line 182: if (min_max_tuple_id_ > -1) { > ">" or < don't make sense for tuple ids Done http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 40: /// writing parquet files. It provides an interface to populate a parquet::Statistics > this description isn't true anymore, it's now also an intermediate staging I removed the expanded abstraction and added a sentence to the comment explaining that it can also be used to decode stats. Line 70: /// 'nullptr' for unsupported types. The caller is responsible for freeing the memory. > this should be created in an objectpool Code has been removed. Line 87: virtual void WriteMinToBuffer(void* buffer) const = 0; > remove "tobuffer", that's clear from the parameters. what's unclear is the Done Line 126: max_value_(max_value) { > formatting Done http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 208: // Conjuncts that can be evaluated against the statistics of parquet files. Each > refer to thrift structs explicitly, where appropriate. Done Line 209: // conjunct references the slot in 'min_max_tuple_id' with the same index. > i didn't understand the last sentence. Changed it. http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 143: // List of conjuncts for min/max values, that are used to skip data when scanning > "min/max values" is vague, refer to thrift structs explicitly. Done Line 145: private List minMaxConjunts_ = Lists.newArrayList(); > conjuncts Done Line 147: // Tuple that is used to materialize statistics when scanning Parquet files. > describe what's in it. Done Line 290:* Builds a predicate to evaluate statistical values by combining 'inputSlot', > make 'statistical values' concrete Done Line 291:* 'inputPred' and 'op' into a new predicate, and adding it to 'minMaxConjunts_'. > missing other side effects Done Line 297: Preconditions.checkState(constExpr.isConstant()); > why not a literal? My reasoning was that any constant expression could be evaluated against the statistics. I added a test to make sure this does indeed work as expected. Line 303: // Make a new slot from the slot descriptor. > what do you mean by 'slot'? (slot/slot descriptor are usually synonyms) I meant slotRef, but the comment looked superfluous and I removed it altogether. Line 315:* TODO: This method currently doesn't de-dup slots that are referenced multiple times. > resolve As discussed, lets keep this for now and address it in a subsequent change by simplifying redundant exprs. I created IMPALA-4958 to track this. Line 328: // We only support slot refs on the left hand side of the predicate, a rewriting > this overlaps with what kuduscannode does, and there's already a similar co I had talked to Alex about this last week and he suggested to add a general rewrite rule in the FE to normalize binary exprs. I removed the branches from BinaryPredicate.java that are no longer needed after the normalization. I also cleaned up code in KuduScanNode.java that is no longer needed. Line 343: { > { on preceding line Done Line 658: msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt()); > check that it's not invalid How can I check this? The tupleId needs to come from the TupleId generator a
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has uploaded a new patch set (#5). Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/hdfs.test M testdata/workloads/functional-planner/queries/PlannerTest/implicit-joins.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test M testdata/workloads/functional-planner/queries/PlannerTest/subquery-rewrite.test A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 29 files changed, 757 insertions(+), 87 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/5 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Mostafa Mokhtar has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test: Line 2: QUERY Can you add coverage for filtering on the root level for nested data? -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Matthew Mulder Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 4: (26 comments) http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 472: ScopedBuffer tuple_buffer(scan_node_->mem_tracker()); no need to allocate this and go through the setup for every single row group. Line 486: conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size()); there's too much memory allocation going on for what are essentially static structures. Line 496: unique_ptr stats( there's no need for memory allocation here Line 501: if (e->function_name() == "lt" || e->function_name() == "le") { introduce named constants (in the appropriate class) http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 378: /// Min/max statistics contexts. who owns them? http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: Line 182: if (min_max_tuple_id_ > -1) { ">" or < don't make sense for tuple ids http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 40: /// writing parquet files. It provides an interface to populate a parquet::Statistics this description isn't true anymore, it's now also an intermediate staging area for the min/max values. i'm not sure this expanded abstraction makes sense (why do you need an intermediate staging area?). Line 70: /// 'nullptr' for unsupported types. The caller is responsible for freeing the memory. this should be created in an objectpool Line 87: virtual void WriteMinToBuffer(void* buffer) const = 0; remove "tobuffer", that's clear from the parameters. what's unclear is the format of the written data. is the void* simply a slot? 'buffer' is too generic. Line 126: max_value_(max_value) { formatting it looks like you're populating an object of this class and then write out the min/max to the corresponding slots. why not go directly from parquet::statistics to the slots? http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 208: // Conjuncts that can be evaluated against the statistics of parquet files. Each refer to thrift structs explicitly, where appropriate. Line 209: // conjunct references the slot in 'min_max_tuple_id' with the same index. i didn't understand the last sentence. http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 143: // List of conjuncts for min/max values, that are used to skip data when scanning "min/max values" is vague, refer to thrift structs explicitly. Line 145: private List minMaxConjunts_ = Lists.newArrayList(); conjuncts Line 147: // Tuple that is used to materialize statistics when scanning Parquet files. describe what's in it. i should have an understanding of the overall approach just based on reading the comments for the related data structures. this might best fit into the class header. Line 290:* Builds a predicate to evaluate statistical values by combining 'inputSlot', make 'statistical values' concrete Line 291:* 'inputPred' and 'op' into a new predicate, and adding it to 'minMaxConjunts_'. missing other side effects Line 297: Preconditions.checkState(constExpr.isConstant()); why not a literal? Line 303: // Make a new slot from the slot descriptor. what do you mean by 'slot'? (slot/slot descriptor are usually synonyms) Line 315:* TODO: This method currently doesn't de-dup slots that are referenced multiple times. resolve Line 328: // We only support slot refs on the left hand side of the predicate, a rewriting this overlaps with what kuduscannode does, and there's already a similar concept in the form of binarypredicate.getboundslot. figure out the similarity (might require a new concept) and move the functionality into binarypredicate. Line 343: { { on preceding line Line 658: msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt()); check that it's not invalid http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java File fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java: Line 35: public class NormalizeBinaryPredicatesRule implements ExprRewriteRule { add tests for rule Line 40: if (!expr.isAnalyzed()) return expr; why do you need this? http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test File testdata/workloads/functional-planner/queries/PlannerTest/a
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 4: PS4 adds the "statistics predicates" to the PlannerTest and removes the custom predicate normalization from KuduScanNode.java, after it is now done during query rewrite. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has uploaded a new patch set (#4). Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalysisContext.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java A fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/data-source-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/empty.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test M testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test M testdata/workloads/functional-planner/queries/PlannerTest/predicate-propagation.test M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 27 files changed, 729 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/4 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 3: PS3 adds the statistics predicates to EXPLAIN output and adds tests for that. -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has uploaded a new patch set (#3). Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/rewrite/NormalizeExprsRule.java A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 15 files changed, 615 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/3 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. Patch Set 2: PS2 adds error handling when deserializing statistics -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has uploaded a new patch set (#2). Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/rewrite/NormalizeExprsRule.java A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 15 files changed, 599 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/2 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker
[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics
Lars Volker has uploaded a new change for review. http://gerrit.cloudera.org:8080/6032 Change subject: IMPALA-2328: Read support for min/max Parquet statistics .. IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the formsor , where is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exprs/expr.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/rewrite/NormalizeExprsRule.java A testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py 15 files changed, 589 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/6032/1 -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker