Jinfeng Ni created DRILL-684:
--------------------------------
Summary: Use parquet row count in cost-based optimization. Use
parquet row count, column value count to optimize count() aggregate function.
Key: DRILL-684
URL: https://issues.apache.org/jira/browse/DRILL-684
Project: Apache Drill
Issue Type: Improvement
Reporter: Jinfeng Ni
Assignee: Jinfeng Ni
Attachments: DRILL-684.1.patch
Parquet group scan provides the exact row count and the exact value count for
each individual column. Such information could be leveraged in the following
two ways:
1. Use the count in the cost estimation, when query refers parquet files.
2. Use the row count or column value count to optimize count() aggregate
function.
For instance, select count(*) from parquet_file;
select count(column_a) from parquet_file;
First query could be transformed to return the row count directly, the second
one could return the column value count for 'column_a'. Both of the two cases
will avoid scan the whole parquet files, thus improve query performance.
--
This message was sent by Atlassian JIRA
(v6.2#6252)