Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/18327 )
Change subject: IMPALA-11123: Optimize count(star) for ORC scans ...................................................................... Patch Set 9: (10 comments) http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@9 PS8, Line 9: This patch provides count(star) optimization for ORC scans, similar to : the work done in IMPALA-5036 fo > nit. This patch provides count(star) optimization for ORC scans, similar to Done http://gerrit.cloudera.org:8080/#/c/18327/8//COMMIT_MSG@23 PS8, Line 23: into HdfsColumnarScanner. > can you add some benchmark results? I think that the difference should be v Done http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/18327/8/be/src/exec/hdfs-orc-scanner.cc@807 PS8, Line 807: : // Process next stripe if current stripe is drained. Each stripe will generate several : // orc batches. We only advance the stripe after processing the last batch. : // 'advance_stripe_' is updated in 'NextStripe', meaning the current stripe we advance : // to can be skip. 'end_of_stripe_' marks whether current stripe is drained. It's only : // set to true in 'AssembleRows'. : while (advance_group_ || end_of_stripe_) { : // The next stripe will use a new dictionary blob so transfer the memory to row_batch. : row_batch->tuple_data_pool()->AcquireData(dictionary_pool_.get(), false); : context_->ReleaseCompletedResources(/* done */ true); : // Commit the rows to flush the row batch from the previous stripe. : RETURN_IF_ERROR(CommitRows(0, row_batch)); : : RETURN_IF_ERROR(NextStripe()); : DCHECK_LE(group_idx_, reader_->getNumberOfStripes()); : if (group_idx_ == reader_->getNumberOfStripes()) { : eos_ = true; : DCHECK(parse_status_.ok()); : return Status::OK(); : } : } : : // Apply any runtime filters to static tuples containing the partition keys for this : // partition. If any filter fails, we return immediately and stop processing this : / > It would be also nice to unify this with the Parquet version - my understan Done. I decide to name it 'rows_read_in_group_'. http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test File testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test: http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@3 PS8, Line 3: # a text table, so the optimization is not applied. The optimization is observed when > nit. May add a comment: Done http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@126 PS8, Line 126: > nit. is Done http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@168 PS8, Line 168: > nit. it can not be applied to the 1st aggregate function. Done http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@181 PS8, Line 181: 07K : ==== > nit. The optimization does apply to the inner count(*). I think it is genuine that count star optimization is not applied for for both 'count(*)' in this case. The cardinality=24 of "00:SCAN HDFS" is coincidence from other optimization and does not come from this patch. I confirmed this by running this query in build without this patch. http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@277 PS8, Line 277: > nit. all predicates are on partition columns only. Done http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-planner/queries/PlannerTest/orc-stats-agg.test@386 PS8, Line 386: row-size=0B cardinality=13.07K > nit. in general. Done http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test File testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test: http://gerrit.cloudera.org:8080/#/c/18327/8/testdata/workloads/functional-query/queries/QueryTest/orc-stats-agg.test@9 PS8, Line 9: bigint > Don't we have a similar counter as for Parquet in the profile? Done. Added profile counter verification. -- To view, visit http://gerrit.cloudera.org:8080/18327 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091 Gerrit-Change-Number: 18327 Gerrit-PatchSet: 9 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Comment-Date: Thu, 31 Mar 2022 18:53:08 +0000 Gerrit-HasComments: Yes
