Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18327 )

Change subject: IMPALA-11123: Optimize count(star) for ORC scans
......................................................................


Patch Set 9:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@9
PS8, Line 9: This patch provides count(star) optimization for ORC scans, 
similar to
           : the work done in IMPALA-5036 fo
> nit. This patch provides count(star) optimization for ORC scans, similar to
Done


http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@23
PS8, Line 23: into HdfsColumnarScanner.
> can you add some benchmark results? I think that the difference should be v
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc@807
PS8, Line 807:
             :   // Process next stripe if current stripe is drained. Each 
stripe will generate several
             :   // orc batches. We only advance the stripe after processing 
the last batch.
             :   // 'advance_stripe_' is updated in 'NextStripe', meaning the 
current stripe we advance
             :   // to can be skip. 'end_of_stripe_' marks whether current 
stripe is drained. It's only
             :   // set to true in 'AssembleRows'.
             :   while (advance_group_ || end_of_stripe_) {
             :     // The next stripe will use a new dictionary blob so 
transfer the memory to row_batch.
             :     
row_batch->tuple_data_pool()->AcquireData(dictionary_pool_.get(), false);
             :     context_->ReleaseCompletedResources(/* done */ true);
             :     // Commit the rows to flush the row batch from the previous 
stripe.
             :     RETURN_IF_ERROR(CommitRows(0, row_batch));
             :
             :     RETURN_IF_ERROR(NextStripe());
             :     DCHECK_LE(group_idx_, reader_->getNumberOfStripes());
             :     if (group_idx_ == reader_->getNumberOfStripes()) {
             :       eos_ = true;
             :       DCHECK(parse_status_.ok());
             :       return Status::OK();
             :     }
             :   }
             :
             :   // Apply any runtime filters to static tuples containing the 
partition keys for this
             :   // partition. If any filter fails, we return immediately and 
stop processing this
             :   /
> It would be also nice to unify this with the Parquet version - my understan
Done. I decide to name it 'rows_read_in_group_'.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@3
PS8, Line 3: # a text table, so the optimization is not applied. The 
optimization is observed when
> nit. May add a comment:
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@126
PS8, Line 126:
> nit. is
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@168
PS8, Line 168: 
> nit. it can not be applied to the 1st aggregate function.
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@181
PS8, Line 181: 07K
             : ====
> nit. The optimization does apply to the inner count(*).
I think it is genuine that count star optimization is not applied for for both 
'count(*)' in this case.
The cardinality=24 of "00:SCAN HDFS" is coincidence from other optimization and 
does not come from this patch.
I confirmed this by running this query in build without this patch.


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@277
PS8, Line 277:
> nit. all predicates are on partition columns only.
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@386
PS8, Line 386:    row-size=0B cardinality=13.07K
> nit. in general.
Done


http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test:

http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@9
PS8, Line 9: bigint
> Don't we have a similar counter as for Parquet in the profile?
Done. Added profile counter verification.



--
To view, visit http://gerrit.cloudera.org:8080/18327
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Gerrit-Change-Number: 18327
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Comment-Date: Thu, 31 Mar 2022 18:53:08 +0000
Gerrit-HasComments: Yes

Reply via email to