[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

ASF GitHub Bot (JIRA) Thu, 13 Oct 2016 17:19:48 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573622#comment-15573622
 ]


ASF GitHub Bot commented on DRILL-4905:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/597#discussion_r83339590
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
    @@ -913,19 +928,25 @@ public GroupScan applyLimit(long maxRecords) {
         long count = 0;
         int index = 0;
         for (RowGroupInfo rowGroupInfo : rowGroupInfos) {
    -      if (count < maxRecords) {
    -        count += rowGroupInfo.getRowCount();
    +      long rowCount = rowGroupInfo.getRowCount();
    --- End diff --
    
    List rowGroupInfos is populated in init() call, when ParquetGroupScan. 
Here, when DrillPushLimitIntoScanRule is fired for the first time, if we reduce 
parquet files, and come to Line 959, we will re-populate rowGroupInfos list. 
    
    The reason that your code works as expected is that 
DrillPushLimitIntoScanRule is fired twice. In the second rule execution, the 
file # is not reduced, but the rowGroupInfos list is updated in this for loop 
block.
    
    However, I think it's not optimal to fire the rule twice. Ideally, we 
should avoid the second firing, since supposely it does nothing (that's a 
separate issue).  We do not want to put code and rely on the assumption that 
this rule will be always fired twice.
    
    Probably, we should update RowGroupInfos after line 959, after new group 
scan is created, and update its RowGroupInfos. 
    
    
     


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> -----------------------------------------------------------------------------------
>
>                 Key: DRILL-4905
>                 URL: https://issues.apache.org/jira/browse/DRILL-4905
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.8.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>             Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from <table> limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

Reply via email to