Jinfeng Ni created DRILL-5773:
---------------------------------
Summary: Project pushdown into a subquery with select *
Key: DRILL-5773
URL: https://issues.apache.org/jira/browse/DRILL-5773
Project: Apache Drill
Issue Type: Improvement
Reporter: Jinfeng Ni
If a subquery / table expression/ view has a `select *` and out query is
requesting a subset of columns/fields, Drill currently does not do project
pushdown into the subquery. As a result, the scan operator will return every
column/field in the table, this would significantly impact query performance,
especially if # of column/field is large.
For instance,
{code}
SELECT n_regionkey, count(*) AS cnt
FROM (SELECT * FROM cp.`tpch/nation.parquet`) AS n
GROUP BY n_regionkey;
{code}
Here is the plan
{code}
00-00 Screen
00-01 Project(n_regionkey=[$0], cnt=[$1])
00-02 Project(n_regionkey=[$0], cnt=[$1])
00-03 HashAgg(group=[{0}], cnt=[COUNT()])
00-04 Project(n_regionkey=[ITEM($0, 'n_regionkey')])
00-05 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=classpath:/tpch/nation.parquet]],
selectionRoot=classpath:/tpch/nation.parquet, numFiles=1,
usedMetadataFile=false, columns=[`*`]]])
{code}
Notice that in Scan operator `columns = *`, indicating that it will read every
column.
>From performance perspective, Drill should push project into subquery with
>select *.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)