[Impala-ASF-CR] IMPALA-5036: Parquet count star optimization

Alex Behm (Code Review) Wed, 07 Jun 2017 18:27:07 -0700

Alex Behm has posted comments on this change.

Change subject: IMPALA-5036: Parquet count star optimization
......................................................................



Patch Set 2:

(23 comments)

Code comments. Still going through the tests.

http://gerrit.cloudera.org:8080/#/c/6812/1/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 432:       DCHECK(row_group_idx_ <= file_metadata_.row_groups.size());
> The file has to have more than 1 row group. I'm pretty sure tpch_parquet.li
Makes sense, batch_size=1 then


Line 455:     int max_tuples = min<int64_t>(row_batch->capacity(), 
rows_remaining);
> Not sure, should we be checking that stats is non-negative? What do you sug
I agree with Mostafa in principle and I believe we have a JIRA to track this. 
However, this patch only optimizes existing behavior, it does not change it. 
Even before this patch we relied on the same num_rows for count star queries, 
so I think we should tackle Mostafa's concern separately.


Line 488:       eos_ = true;
> To keep it consistent with line 434. I don't like checking for greater or e
Ok, DCHECK_LE() then


Line 1455:     // Skip partition columns
> There can be several slots if we are grouping by a partition column. Left u
Mostafa, we need the per-column null count to optimize queries like select 
count(l_comment) from lineitem. Good idea to do that, but not in this patch. 
Pooja is working on populating the null count.


http://gerrit.cloudera.org:8080/#/c/6812/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 428:     RETURN_IF_ERROR(
In most cases we're allocating way too much here. We can get a more accurate 
capacity by taking the minimum of the batch size and the number of row groups 
in this file. There's a static ResizeAndAllocateTupleBuffer() in RowBatch that 
can be used for this purpose.


Line 437:       int64_t* dst_slot = 
reinterpret_cast<int64_t*>(dst_tuple->GetSlot(0));
The dst_tuple->GetSlot(0) looks wrong. There's no guarantee that the target 
slot is always at that offset. I think we may need to plumb through the slot id 
from the FE and then get the slot offset from the slot descriptor.


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java
File fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java:

Line 333:       FunctionCallExpr f = (FunctionCallExpr) substitutedAgg;
remove


Line 359:     //Preconditions.checkState(mergeAggInfo_ == null);
?


Line 659:     if (getMaterializedAggregateExprs().size() != 1) return false;
Avoid calling getMaterializedAggregateExprs(() twice


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

Line 571:   public FunctionCallExpr getMergeAggInputFn() { return 
mergeAggInputFn_; }
Do we need these getters/setters?


Line 621:       if (!(substitutedFn instanceof FunctionCallExpr)) return e;
This looks like an error we should not ignore. Convert into Preconditions check.


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
File fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java:

Line 334:     ArrayList<SlotDescriptor> materializedSlots = 
getMaterializedSlots();
The previous implementation seemed cheaper, I'd prefer to leave it. Using 
hdfsTable.isClusteringColumn() is still better.


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 105:  * TODO: pass in range restrictions.
Might be good to add a word or two about the agg optimization flow, i.e., the 
caller passes in some info about the agg in this query block, then this scan 
decides whether whether to apply the optimization or not, then the produced 
smap must be applied to the agg infos from this query block.


Line 134:   // Set to true if the count(*) aggregation can be optimized by 
populating the tuple with
Set to true if the query block of this scan has a count(*) aggregation that is 
amenable to optimization by populating the scan tuple with Parquet metadata. 
This scan does additional analysis in init() to determine whether it is correct 
to apply the optimization.


Line 238:     // Create two functions that we will put into an smap. We want to 
replace the
Integrate this into a function comment. It should describe what this function 
does, i.e. it adds a new slot descriptor and it populates the 
optimizedAggSmap_. Describe what the entry looks like.


Line 247:     sd.setType(Type.BIGINT);
set slot as non-nullable


Line 253:     sumFn.analyze(analyzer);
use analyzeNoThrow() and remove the throws declaration


Line 288:           (getTupleDesc().getMaterializedSlots().isEmpty() ||
use desc_ instead of getTupleDesc()


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

Line 1207:    * If 'hdfsTblRef' only contains partition columns and 
'fastPartitionKeyScans'
comment on new param


Line 1268:    * 'fastPartitionKeyScans' indicates whether to try to produce the 
slots with
comment new param


Line 1512:    * 'fastPartitionKeyScans' indicates whether to try to produce the 
slots with
comment new param


http://gerrit.cloudera.org:8080/#/c/6812/2/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java:

Line 378:     RewritesOk("count(id)", rule, null);
Also check count(1+1) and count(1+NULL)


http://gerrit.cloudera.org:8080/#/c/6812/1/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test:

Line 22: |  |  output: sum_init_zero(functional_parquet.alltypes.parquet-stats: 
num_rows)
> Not quite sure how to do that. We set the SlotDescriptor label and then it 
Looks like you fixed, thanks!


-- 
To view, visit http://gerrit.cloudera.org:8080/6812
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5036: Parquet count star optimization

Reply via email to