[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

Lars Volker (Code Review) Tue, 21 Feb 2017 16:34:36 -0800

Lars Volker has posted comments on this change.

Change subject: IMPALA-2328: Read support for min/max Parquet statistics
......................................................................



Patch Set 4:

(26 comments)

Thanks for the review. Please see PS6.

http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 472:   ScopedBuffer tuple_buffer(scan_node_->mem_tracker());
> no need to allocate this and go through the setup for every single row grou
Done


Line 486:   conjuncts_with_stats.reserve(min_max_conjuncts_ctxs_.size());
> there's too much memory allocation going on for what are essentially static
Done


Line 496:     unique_ptr<ColumnStatsBase> stats(
> there's no need for memory allocation here
Done


Line 501:     if (e->function_name() == "lt" || e->function_name() == "le") {
> introduce named constants (in the appropriate class)
Other code that compares function_name follows the same pattern. As discussed 
in person, let's keep this for now and see if we find a nicer abstraction to 
determine when to use min/max values, e.g. monotonically bound predicates.


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 378:   /// Min/max statistics contexts.
> who owns them?
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

Line 182:   if (min_max_tuple_id_ > -1) {
> ">" or < don't make sense for tuple ids
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 40: /// writing parquet files. It provides an interface to populate a 
parquet::Statistics
> this description isn't true anymore, it's now also an intermediate staging 
I removed the expanded abstraction and added a sentence to the comment 
explaining that it can also be used to decode stats.


Line 70:   /// 'nullptr' for unsupported types. The caller is responsible for 
freeing the memory.
> this should be created in an objectpool
Code has been removed.


Line 87:   virtual void WriteMinToBuffer(void* buffer) const = 0;
> remove "tobuffer", that's clear from the parameters. what's unclear is the 
Done


Line 126:     max_value_(max_value) {
> formatting
Done


http://gerrit.cloudera.org:8080/#/c/6032/4/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 208:   // Conjuncts that can be evaluated against the statistics of 
parquet files. Each
> refer to thrift structs explicitly, where appropriate.
Done


Line 209:   // conjunct references the slot in 'min_max_tuple_id' with the same 
index.
> i didn't understand the last sentence.
Changed it.


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 143:   // List of conjuncts for min/max values, that are used to skip data 
when scanning
> "min/max values" is vague, refer to thrift structs explicitly.
Done


Line 145:   private List<Expr> minMaxConjunts_ = Lists.newArrayList();
> conjuncts
Done


Line 147:   // Tuple that is used to materialize statistics when scanning 
Parquet files.
> describe what's in it.
Done


Line 290:    * Builds a predicate to evaluate statistical values by combining 
'inputSlot',
> make 'statistical values' concrete
Done


Line 291:    * 'inputPred' and 'op' into a new predicate, and adding it to 
'minMaxConjunts_'.
> missing other side effects
Done


Line 297:     Preconditions.checkState(constExpr.isConstant());
> why not a literal?
My reasoning was that any constant expression could be evaluated against the 
statistics. I added a test to make sure this does indeed work as expected.


Line 303:     // Make a new slot from the slot descriptor.
> what do you mean by 'slot'? (slot/slot descriptor are usually synonyms)
I meant slotRef, but the comment looked superfluous and I removed it altogether.


Line 315:    * TODO: This method currently doesn't de-dup slots that are 
referenced multiple times.
> resolve
As discussed, lets keep this for now and address it in a subsequent change by 
simplifying redundant exprs. I created IMPALA-4958 to track this.


Line 328:       // We only support slot refs on the left hand side of the 
predicate, a rewriting
> this overlaps with what kuduscannode does, and there's already a similar co
I had talked to Alex about this last week and he suggested to add a general 
rewrite rule in the FE to normalize binary exprs. I removed the branches from 
BinaryPredicate.java that are no longer needed after the normalization. I also 
cleaned up code in KuduScanNode.java that is no longer needed.


Line 343:       {
> { on preceding line
Done


Line 658:       
msg.hdfs_scan_node.setMin_max_tuple_id(minMaxTuple_.getId().asInt());
> check that it's not invalid
How can I check this? The tupleId needs to come from the TupleId generator and 
there is no ctor for a TupleDescriptor that leaves it uninitialized. Do you 
mean minMaxTuple_.getId() != null ?


http://gerrit.cloudera.org:8080/#/c/6032/4/fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java
File 
fe/src/main/java/org/apache/impala/rewrite/NormalizeBinaryPredicatesRule.java:

Line 35: public class NormalizeBinaryPredicatesRule implements ExprRewriteRule {
> add tests for rule
Done


Line 40:     if (!expr.isAnalyzed()) return expr;
> why do you need this?
I had copied this over, removed it.


http://gerrit.cloudera.org:8080/#/c/6032/4/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

Line 1047:    statistics predicates: c_nationkey <= 16, c_nationkey >= 16
> i think it's better to retain the original predicate (you don't need to sho
I changed the code to track the original predicates and display them instead. I 
also changed the explain level to "extended", based on it's description in the 
docs (below). While the statistics here are different from column stats, I 
think it makes sense to keep "all statistics" in the same level. However I 
don't feel strongly about this - let me know if you want me to change it to 
verbose. On a related note, do you think we should change "statistics 
predicates" to something like "parquet statistics predicates" to make it 
clearer that this is not column statistics?

"Includes additional detail about how the query planner uses statistics in its 
decision-making process, to understand how a query could be tuned by gathering 
statistics, using query hints, adding or removing predicates, and so on."


-- 
To view, visit http://gerrit.cloudera.org:8080/6032
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Matthew Mulder <mmul...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2328: Read support for min/max Parquet statistics

Reply via email to