Lars Volker has posted comments on this change. Change subject: IMPALA-2328: Read support for min/max Parquet statistics ......................................................................
Patch Set 4: (26 comments) Thanks for the review. Please see PS6. http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 472: ScopedBuffer tuple_buffer(scan_node_->mem_tracker()); > no need to allocate this and go through the setup for every single row grou Done Line 486: conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size()); > there's too much memory allocation going on for what are essentially static Done Line 496: unique_ptr<ColumnStatsBase> stats( > there's no need for memory allocation here Done Line 501: if (e->function_name() == "lt" || e->function_name() == "le") { > introduce named constants (in the appropriate class) Other code that compares function_name follows the same pattern. As discussed in person, let's keep this for now and see if we find a nicer abstraction to determine when to use min/max values, e.g. monotonically bound predicates. http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 378: /// Min/max statistics contexts. > who owns them? Done http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: Line 182: if (min_max_tuple_id_ > -1) { > ">" or < don't make sense for tuple ids Done http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 40: /// writing parquet files. It provides an interface to populate a parquet::Statistics > this description isn't true anymore, it's now also an intermediate staging I removed the expanded abstraction and added a sentence to the comment explaining that it can also be used to decode stats. Line 70: /// 'nullptr' for unsupported types. The caller is responsible for freeing the memory. > this should be created in an objectpool Code has been removed. Line 87: virtual void WriteMinToBuffer(void* buffer) const = 0; > remove "tobuffer", that's clear from the parameters. what's unclear is the Done Line 126: max_value_(max_value) { > formatting Done http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 208: // Conjuncts that can be evaluated against the statistics of parquet files. Each > refer to thrift structs explicitly, where appropriate. Done Line 209: // conjunct references the slot in 'min_max_tuple_id' with the same index. > i didn't understand the last sentence. Changed it. http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 143: // List of conjuncts for min/max values, that are used to skip data when scanning > "min/max values" is vague, refer to thrift structs explicitly. Done Line 145: private List<Expr> minMaxConjunts_ = Lists.newArrayList(); > conjuncts Done Line 147: // Tuple that is used to materialize statistics when scanning Parquet files. > describe what's in it. Done Line 290: * Builds a predicate to evaluate statistical values by combining 'inputSlot', > make 'statistical values' concrete Done Line 291: * 'inputPred' and 'op' into a new predicate, and adding it to 'minMaxConjunts_'. > missing other side effects Done Line 297: Preconditions.checkState(constExpr.isConstant()); > why not a literal? My reasoning was that any constant expression could be evaluated against the statistics. I added a test to make sure this does indeed work as expected. Line 303: // Make a new slot from the slot descriptor. > what do you mean by 'slot'? (slot/slot descriptor are usually synonyms) I meant slotRef, but the comment looked superfluous and I removed it altogether. Line 315: * TODO: This method currently doesn't de-dup slots that are referenced multiple times. > resolve As discussed, lets keep this for now and address it in a subsequent change by simplifying redundant exprs. I created IMPALA-4958 to track this. Line 328: // We only support slot refs on the left hand side of the predicate, a rewriting > this overlaps with what kuduscannode does, and there's already a similar co I had talked to Alex about this last week and he suggested to add a general rewrite rule in the FE to normalize binary exprs. I removed the branches from BinaryPredicate.java that are no longer needed after the normalization. I also cleaned up code in KuduScanNode.java that is no longer needed. Line 343: { > { on preceding line Done Line 658: msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt()); > check that it's not invalid How can I check this? The tupleId needs to come from the TupleId generator and there is no ctor for a TupleDescriptor that leaves it uninitialized. Do you mean minMaxTuple_.getId() != null ? http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java File fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java: Line 35: public class NormalizeBinaryPredicatesRule implements ExprRewriteRule { > add tests for rule Done Line 40: if (!expr.isAnalyzed()) return expr; > why do you need this? I had copied this over, removed it. http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test: Line 1047: statistics predicates: c_nationkey <= 16, c_nationkey >= 16 > i think it's better to retain the original predicate (you don't need to sho I changed the code to track the original predicates and display them instead. I also changed the explain level to "extended", based on it's description in the docs (below). While the statistics here are different from column stats, I think it makes sense to keep "all statistics" in the same level. However I don't feel strongly about this - let me know if you want me to change it to verbose. On a related note, do you think we should change "statistics predicates" to something like "parquet statistics predicates" to make it clearer that this is not column statistics? "Includes additional detail about how the query planner uses statistics in its decision-making process, to understand how a query could be tuned by gathering statistics, using query hints, adding or removing predicates, and so on." -- To view, visit http://gerrit.cloudera.org:8080/6032 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Matthew Mulder <mmul...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-HasComments: Yes