Alex Behm has posted comments on this change. Change subject: IMPALA-5036: Parquet count star optimization ......................................................................
Patch Set 2: (23 comments) Code comments. Still going through the tests. http://gerrit.cloudera.org:8080/#/c/6812/1/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 432: DCHECK(row_group_idx_ <= file_metadata_.row_groups.size()); > The file has to have more than 1 row group. I'm pretty sure tpch_parquet.li Makes sense, batch_size=1 then Line 455: int max_tuples = min<int64_t>(row_batch->capacity(), rows_remaining); > Not sure, should we be checking that stats is non-negative? What do you sug I agree with Mostafa in principle and I believe we have a JIRA to track this. However, this patch only optimizes existing behavior, it does not change it. Even before this patch we relied on the same num_rows for count star queries, so I think we should tackle Mostafa's concern separately. Line 488: eos_ = true; > To keep it consistent with line 434. I don't like checking for greater or e Ok, DCHECK_LE() then Line 1455: // Skip partition columns > There can be several slots if we are grouping by a partition column. Left u Mostafa, we need the per-column null count to optimize queries like select count(l_comment) from lineitem. Good idea to do that, but not in this patch. Pooja is working on populating the null count. http://gerrit.cloudera.org:8080/#/c/6812/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 428: RETURN_IF_ERROR( In most cases we're allocating way too much here. We can get a more accurate capacity by taking the minimum of the batch size and the number of row groups in this file. There's a static ResizeAndAllocateTupleBuffer() in RowBatch that can be used for this purpose. Line 437: int64_t* dst_slot = reinterpret_cast<int64_t*>(dst_tuple->GetSlot(0)); The dst_tuple->GetSlot(0) looks wrong. There's no guarantee that the target slot is always at that offset. I think we may need to plumb through the slot id from the FE and then get the slot offset from the slot descriptor. http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java File fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java: Line 333: FunctionCallExpr f = (FunctionCallExpr) substitutedAgg; remove Line 359: //Preconditions.checkState(mergeAggInfo_ == null); ? Line 659: if (getMaterializedAggregateExprs().size() != 1) return false; Avoid calling getMaterializedAggregateExprs(() twice http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: Line 571: public FunctionCallExpr getMergeAggInputFn() { return mergeAggInputFn_; } Do we need these getters/setters? Line 621: if (!(substitutedFn instanceof FunctionCallExpr)) return e; This looks like an error we should not ignore. Convert into Preconditions check. http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java File fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java: Line 334: ArrayList<SlotDescriptor> materializedSlots = getMaterializedSlots(); The previous implementation seemed cheaper, I'd prefer to leave it. Using hdfsTable.isClusteringColumn() is still better. http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 105: * TODO: pass in range restrictions. Might be good to add a word or two about the agg optimization flow, i.e., the caller passes in some info about the agg in this query block, then this scan decides whether whether to apply the optimization or not, then the produced smap must be applied to the agg infos from this query block. Line 134: // Set to true if the count(*) aggregation can be optimized by populating the tuple with Set to true if the query block of this scan has a count(*) aggregation that is amenable to optimization by populating the scan tuple with Parquet metadata. This scan does additional analysis in init() to determine whether it is correct to apply the optimization. Line 238: // Create two functions that we will put into an smap. We want to replace the Integrate this into a function comment. It should describe what this function does, i.e. it adds a new slot descriptor and it populates the optimizedAggSmap_. Describe what the entry looks like. Line 247: sd.setType(Type.BIGINT); set slot as non-nullable Line 253: sumFn.analyze(analyzer); use analyzeNoThrow() and remove the throws declaration Line 288: (getTupleDesc().getMaterializedSlots().isEmpty() || use desc_ instead of getTupleDesc() http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: Line 1207: * If 'hdfsTblRef' only contains partition columns and 'fastPartitionKeyScans' comment on new param Line 1268: * 'fastPartitionKeyScans' indicates whether to try to produce the slots with comment new param Line 1512: * 'fastPartitionKeyScans' indicates whether to try to produce the slots with comment new param http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java: Line 378: RewritesOk("count(id)", rule, null); Also check count(1+1) and count(1+NULL) http://gerrit.cloudera.org:8080/#/c/6812/1/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test: Line 22: | | output: sum_init_zero(functional_parquet.alltypes.parquet-stats: num_rows) > Not quite sure how to do that. We set the SlotDescriptor label and then it Looks like you fixed, thanks! -- To view, visit http://gerrit.cloudera.org:8080/6812 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-HasComments: Yes