[ https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152837#comment-15152837 ]
ASF GitHub Bot commented on DRILL-4387: --------------------------------------- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/379#discussion_r53361870 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -87,9 +87,6 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS newColumns.add(column); } } - if (newColumns.isEmpty()) { --- End diff -- I went through all the ScanBatchCreator in Drill's code base. Seems ParquetScanBatchCreator is the only one that is converting an empty column list to ALL_COLUMNS. Looking at the history, seems DRILL-1845 added the code, probably just to make it work in parquet for skipAll query. With the patch of DRILL-4279, parquet record reader would be able to handle empty column list. Besides ParquetScanBatchCreator, this patch also modifies HBaseGroupScan, EasyGroupScan where it originally interprets empty column lists into ALL_COLUMNS. I'll add some comment to the code to clarify the different meaning of NULL and empty column list. > Improve execution side when it handles skipAll query > ---------------------------------------------------- > > Key: DRILL-4387 > URL: https://issues.apache.org/jira/browse/DRILL-4387 > Project: Apache Drill > Issue Type: Bug > Reporter: Jinfeng Ni > Assignee: Jinfeng Ni > Fix For: 1.6.0 > > > DRILL-4279 changes the planner side and the RecordReader in the execution > side when they handles skipAll query. However, it seems there are other > places in the codebase that do not handle skipAll query efficiently. In > particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty > column list with star column. This essentially will force the execution side > (RecordReader) to fetch all the columns for data source. Such behavior will > lead to big performance overhead for the SCAN operator. > To improve Drill's performance, we should change those places as well, as a > follow-up work after DRILL-4279. > One simple example of this problem is: > {code} > SELECT DISTINCT substring(dir1, 5) from dfs.`/Path/To/ParquetTable`; > {code} > The query does not require any regular column from the parquet file. However, > ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the > column list. In case table has dozens or hundreds of columns, this will make > SCAN operator much more expensive than necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)