[ https://issues.apache.org/jira/browse/IMPALA-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Behm updated IMPALA-5036: ----------------------------------- Description: {code} select count(*) from parquet_table; select count(*) from parquet_table group by partition_col; {code} Impala already has a special code path for fast Parquet scans when no columns are scanned and materialized, but the performance can be significantly improved with a plan+execution change, as follows: *Execution change* Instead of returning empty batches until num_rows have been returned, the Parquet scanner can populate a single slot with the num_rows from the Parquet row groups *Plan change* The count(*) local aggregation needs to be changed to a sum(num_rows_slot) aggregation. The final distributed plan will be: scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot)) This optimization is applicable where is only a count(*) and there are no scan predicates. was: {code} select count(*) from parquet_table; select count(*) from parquet_table group by partition_col; {code} Impala already has a special code path for fast Parquet scans when no columns are scanned and materialized, but the performance can be significantly improved with a plan+execution change, as follows: Execution change: Instead of returning empty batches until num_rows have been returned, the Parquet scanner can populate a single slot with the num_rows from the Parquet row groups Plan change: The count(*) local aggregation needs to be changed to a sum(num_rows_slot) aggregation. The final distributed plan will be: scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot)) This optimization is applicable where there is only count(*) and no scan predicates. > Improve COUNT(*) performance of Parquet scans. > ---------------------------------------------- > > Key: IMPALA-5036 > URL: https://issues.apache.org/jira/browse/IMPALA-5036 > Project: IMPALA > Issue Type: Sub-task > Components: Backend > Affects Versions: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0 > Reporter: Alexander Behm > Labels: parquet, performance > > {code} > select count(*) from parquet_table; > select count(*) from parquet_table group by partition_col; > {code} > Impala already has a special code path for fast Parquet scans when no columns > are scanned and materialized, but the performance can be significantly > improved with a plan+execution change, as follows: > *Execution change* > Instead of returning empty batches until num_rows have been returned, the > Parquet scanner can populate a single slot with the num_rows from the Parquet > row groups > *Plan change* > The count(*) local aggregation needs to be changed to a sum(num_rows_slot) > aggregation. > The final distributed plan will be: > scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot)) > This optimization is applicable where is only a count(*) and there are no > scan predicates. -- This message was sent by Atlassian JIRA (v6.3.15#6346)