[ 
https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152837#comment-15152837
 ] 

ASF GitHub Bot commented on DRILL-4387:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/379#discussion_r53361870
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
    @@ -87,9 +87,6 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
               newColumns.add(column);
             }
           }
    -      if (newColumns.isEmpty()) {
    --- End diff --
    
    I went through all the ScanBatchCreator in Drill's code base. Seems 
ParquetScanBatchCreator is the only one that is converting an empty column list 
to ALL_COLUMNS. Looking at the history, seems DRILL-1845 added the code, 
probably just to make it work in parquet for skipAll query.  
    
    With the patch of DRILL-4279, parquet record reader would be able to handle 
empty column list. 
    
    Besides ParquetScanBatchCreator, this patch also modifies HBaseGroupScan, 
EasyGroupScan where it originally interprets empty column lists into 
ALL_COLUMNS. 
    
    I'll add some comment to the code to clarify the different meaning of NULL 
and empty column list. 



> Improve execution side when it handles skipAll query
> ----------------------------------------------------
>
>                 Key: DRILL-4387
>                 URL: https://issues.apache.org/jira/browse/DRILL-4387
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>             Fix For: 1.6.0
>
>
> DRILL-4279 changes the planner side and the RecordReader in the execution 
> side when they handles skipAll query. However, it seems there are other 
> places in the codebase that do not handle skipAll query efficiently. In 
> particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty 
> column list with star column. This essentially will force the execution side 
> (RecordReader) to fetch all the columns for data source. Such behavior will 
> lead to big performance overhead for the SCAN operator.
> To improve Drill's performance, we should change those places as well, as a 
> follow-up work after DRILL-4279.
> One simple example of this problem is:
> {code}
>    SELECT DISTINCT substring(dir1, 5) from  dfs.`/Path/To/ParquetTable`;  
> {code}
> The query does not require any regular column from the parquet file. However, 
> ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the 
> column list. In case table has dozens or hundreds of columns, this will make 
> SCAN operator much more expensive than necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to