[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15596788#comment-15596788 ] ASF GitHub Bot commented on DRILL-4905: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/597 > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15596393#comment-15596393 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on the issue: https://github.com/apache/drill/pull/597 +1 Thanks for the PR. It's good that you added couple of more unit tests. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586890#comment-15586890 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on the issue: https://github.com/apache/drill/pull/597 updated diffs with review comments taken care of. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586883#comment-15586883 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r83965044 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetGroupScan.java --- @@ -117,4 +119,18 @@ public void testSelectEmptyNoCache() throws Exception { uex.getMessage()), uex.getMessage().contains(expectedMsg)); } } + + @Test + public void testLimit() throws Exception { +List results = testSqlWithResults(String.format("select * from cp.`parquet/limitTest.parquet` limit 1")); --- End diff -- done > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586882#comment-15586882 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r83965017 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java --- @@ -139,6 +155,11 @@ public ParquetRecordReader( this.batchSize = batchSize; this.footer = footer; this.fragmentContext = fragmentContext; +if (numRecordsToRead == DEFAULT_RECORDS_TO_READ_NOT_SPECIFIED) { + this.numRecordsToRead = footer.getBlocks().get(rowGroupIndex).getRowCount(); +} else { + this.numRecordsToRead = numRecordsToRead; --- End diff -- Current code handles if numRecordsToRead is not in the range. So, I am not adding additional checks here. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573623#comment-15573623 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r83335915 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetGroupScan.java --- @@ -117,4 +119,18 @@ public void testSelectEmptyNoCache() throws Exception { uex.getMessage()), uex.getMessage().contains(expectedMsg)); } } + + @Test + public void testLimit() throws Exception { +List results = testSqlWithResults(String.format("select * from cp.`parquet/limitTest.parquet` limit 1")); --- End diff -- In stead of using a new parquet file, you may consider using cp.`tpch/nation.parquet`, unless the new file has some properties that the existing files do not have. It's preferable to use existing file, since 1) reduce drill package size, 2) people is more familiar with tpch sample files; making it easier to understand the expected result. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573624#comment-15573624 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r83335163 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetRecordReaderTest.java --- @@ -637,7 +637,7 @@ public void testPerformance(@Injectable final DrillbitContext bitContext, final FileSystem fs = new CachedSingleFileSystem(fileName); final BufferAllocator allocator = RootAllocatorFactory.newRoot(c); for(int i = 0; i < 25; i++) { - final ParquetRecordReader rr = new ParquetRecordReader(context, 256000, fileName, 0, fs, + final ParquetRecordReader rr = new ParquetRecordReader(context, 256000, -1, fileName, 0, fs, --- End diff -- If no param for "NumRecordsToRead" means rowCount, we do not need to change here. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573626#comment-15573626 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r83335982 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetGroupScan.java --- @@ -117,4 +119,18 @@ public void testSelectEmptyNoCache() throws Exception { uex.getMessage()), uex.getMessage().contains(expectedMsg)); } } + + @Test + public void testLimit() throws Exception { +List results = testSqlWithResults(String.format("select * from cp.`parquet/limitTest.parquet` limit 1")); --- End diff -- Also, can you add a unit test with LIMINT n , where n > # of rows in the table? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573621#comment-15573621 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r83334623 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java --- @@ -139,6 +155,11 @@ public ParquetRecordReader( this.batchSize = batchSize; this.footer = footer; this.fragmentContext = fragmentContext; +if (numRecordsToRead == DEFAULT_RECORDS_TO_READ_NOT_SPECIFIED) { + this.numRecordsToRead = footer.getBlocks().get(rowGroupIndex).getRowCount(); +} else { + this.numRecordsToRead = numRecordsToRead; --- End diff -- I assume numRecordsToRead should be within range of [1, rowCount]. Do we need enforce such range? What if I pass in 0 or rowCount + 100 ? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573622#comment-15573622 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r83339590 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -913,19 +928,25 @@ public GroupScan applyLimit(long maxRecords) { long count = 0; int index = 0; for (RowGroupInfo rowGroupInfo : rowGroupInfos) { - if (count < maxRecords) { -count += rowGroupInfo.getRowCount(); + long rowCount = rowGroupInfo.getRowCount(); --- End diff -- List rowGroupInfos is populated in init() call, when ParquetGroupScan. Here, when DrillPushLimitIntoScanRule is fired for the first time, if we reduce parquet files, and come to Line 959, we will re-populate rowGroupInfos list. The reason that your code works as expected is that DrillPushLimitIntoScanRule is fired twice. In the second rule execution, the file # is not reduced, but the rowGroupInfos list is updated in this for loop block. However, I think it's not optimal to fire the rule twice. Ideally, we should avoid the second firing, since supposely it does nothing (that's a separate issue). We do not want to put code and rely on the assumption that this rule will be always fired twice. Probably, we should update RowGroupInfos after line 959, after new group scan is created, and update its RowGroupInfos. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573625#comment-15573625 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r83334160 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java --- @@ -115,17 +118,30 @@ public ParquetRecordReader(FragmentContext fragmentContext, String path, int rowGroupIndex, + long numRecordsToRead, FileSystem fs, CodecFactory codecFactory, ParquetMetadata footer, List columns) throws ExecutionSetupException { -this(fragmentContext, DEFAULT_BATCH_LENGTH_IN_BITS, path, rowGroupIndex, fs, codecFactory, footer, -columns); +this(fragmentContext, DEFAULT_BATCH_LENGTH_IN_BITS, numRecordsToRead, + path, rowGroupIndex, fs, codecFactory, footer, columns); + } + + public ParquetRecordReader(FragmentContext fragmentContext, + String path, + int rowGroupIndex, + FileSystem fs, + CodecFactory codecFactory, + ParquetMetadata footer, + List columns) throws ExecutionSetupException { + this(fragmentContext, DEFAULT_BATCH_LENGTH_IN_BITS, DEFAULT_RECORDS_TO_READ_NOT_SPECIFIED, --- End diff -- I think we can get rid of DEFAULT_RECORDS_TO_READ_NOT_SPECIFIED. If numRecordsToRead is not specified, just use rowCount in the row Group. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566419#comment-15566419 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on the issue: https://github.com/apache/drill/pull/597 updated with new diffs. Please review. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534211#comment-15534211 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81245030 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -864,6 +872,14 @@ public String getDigest() { return toString(); } + public void setCacheFileRoot(String cacheFileRoot) { +this.cacheFileRoot = cacheFileRoot; + } + + public void setBatchSize(long batchSize) { +this.recommendedBatchSize = batchSize; --- End diff -- I feel it's fine we do not have such check. Max value for option is checked when users tries "alter session set" command. For "limit n" query where n is > max value, ParquetRecordReader would pick min of two values, essentially ignore this recommendedBatchSize (?). > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534206#comment-15534206 ] ASF GitHub Bot commented on DRILL-4905: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81244826 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -864,6 +872,14 @@ public String getDigest() { return toString(); } + public void setCacheFileRoot(String cacheFileRoot) { +this.cacheFileRoot = cacheFileRoot; + } + + public void setBatchSize(long batchSize) { +this.recommendedBatchSize = batchSize; --- End diff -- Yes, sure. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534201#comment-15534201 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81244611 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -864,6 +872,14 @@ public String getDigest() { return toString(); } + public void setCacheFileRoot(String cacheFileRoot) { +this.cacheFileRoot = cacheFileRoot; + } + + public void setBatchSize(long batchSize) { +this.recommendedBatchSize = batchSize; --- End diff -- checking against current value is fine for now. Adding new methods will be a bit outside the scope of this JIRA. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534172#comment-15534172 ] ASF GitHub Bot commented on DRILL-4905: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81242837 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -864,6 +872,14 @@ public String getDigest() { return toString(); } + public void setCacheFileRoot(String cacheFileRoot) { +this.cacheFileRoot = cacheFileRoot; + } + + public void setBatchSize(long batchSize) { +this.recommendedBatchSize = batchSize; --- End diff -- This will check against the current value of the option rather than the max value? I thought we want to ensure never to exceed the MAX value of PARQUET_READER_RECORD_BATCH_SIZE - did I misunderstand? I looked into the code and it seems like we don't have access to the max value - we would need to add new methods to expose the same OR maybe invoke the validate method? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534097#comment-15534097 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81238087 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -864,6 +872,14 @@ public String getDigest() { return toString(); } + public void setCacheFileRoot(String cacheFileRoot) { +this.cacheFileRoot = cacheFileRoot; + } + + public void setBatchSize(long batchSize) { +this.recommendedBatchSize = batchSize; --- End diff -- ok. Added the assert. Check updated diffs. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534031#comment-15534031 ] ASF GitHub Bot commented on DRILL-4905: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81232963 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -864,6 +872,14 @@ public String getDigest() { return toString(); } + public void setCacheFileRoot(String cacheFileRoot) { +this.cacheFileRoot = cacheFileRoot; + } + + public void setBatchSize(long batchSize) { +this.recommendedBatchSize = batchSize; --- End diff -- Would it be good to add an assert to check recommendedBatchSize <= PARQUET_READER_RECORD_BATCH_SIZE max_value? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533995#comment-15533995 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on the issue: https://github.com/apache/drill/pull/597 +1 LGTM. If possible, please run a performance workload. The patch supposedly should not impact the performance workload. But in case I might miss something, it would be good to run the performance regression to make sure. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533979#comment-15533979 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on the issue: https://github.com/apache/drill/pull/597 updated with all review comments taken care of. Please review. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533511#comment-15533511 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81196419 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -894,11 +910,21 @@ public GroupScan clone(List columns) { public FileGroupScan clone(FileSelection selection) throws IOException { ParquetGroupScan newScan = new ParquetGroupScan(this); newScan.modifyFileSelection(selection); -newScan.cacheFileRoot = selection.cacheFileRoot; +newScan.setCacheFileRoot(selection.cacheFileRoot); newScan.init(selection.getMetaContext()); return newScan; } + // clone to create new groupscan with new file selection and batchSize. + public ParquetGroupScan clone(FileSelection selection, int batchSize) throws IOException { +ParquetGroupScan newScan = new ParquetGroupScan(this); --- End diff -- Is it possible to share the code with the method in line 910 - 914? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533501#comment-15533501 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81196052 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -926,16 +952,22 @@ public GroupScan applyLimit(long maxRecords) { fileNames.add(rowGroupInfo.getPath()); } -if (fileNames.size() == fileSet.size() ) { +// If there is no change in fileSet and maxRecords is >= batchSize, no need to create new groupScan. +if (fileNames.size() == fileSet.size() && (maxRecords >= recommendedBatchSize) ) { // There is no reduction of rowGroups. Return the original groupScan. logger.debug("applyLimit() does not apply!"); return null; } +// If limit maxRecords is less than batchSize, update batchSize to the limit size. +if (maxRecords < recommendedBatchSize) { + recommendedBatchSize = (int) maxRecords; +} + try { FileSelection newSelection = new FileSelection(null, Lists.newArrayList(fileNames), getSelectionRoot(), cacheFileRoot, false); logger.debug("applyLimit() reduce parquet file # from {} to {}", fileSet.size(), fileNames.size()); - return this.clone(newSelection); + return this.clone(newSelection, recommendedBatchSize); --- End diff -- I feel in case file selection is unchanged and maxRecords < recommenedBatchSize, we do not have to re-create a new parquetGroupScan. In such case, all we need is to re-set batchsize. Recreate a parquetgroup with same fileselection would incur overhead of reading parquet metadata. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533479#comment-15533479 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r81194437 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -926,16 +944,22 @@ public GroupScan applyLimit(long maxRecords) { fileNames.add(rowGroupInfo.getPath()); } -if (fileNames.size() == fileSet.size() ) { +// If there is no change in fileSet and maxRecords is >= batchSize, no need to create new groupScan. +if (fileNames.size() == fileSet.size() && (maxRecords >= batchSize) ) { // There is no reduction of rowGroups. Return the original groupScan. logger.debug("applyLimit() does not apply!"); return null; } +// If limit maxRecords is less than batchSize, update batchSize to the limit size. +if (maxRecords < batchSize) { + batchSize = (int) maxRecords; --- End diff -- Maybe I miss something here. But seems to me as long as you declare "long batchSize" in ParquetGroupScan/RowGroupScan (All the code seems to be new code), you do not have to have this cast. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530521#comment-15530521 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80990803 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -115,6 +115,8 @@ private List rowGroupInfos; private Metadata.ParquetTableMetadataBase parquetTableMetadata = null; private String cacheFileRoot = null; + private int batchSize; + private static final int DEFAULT_BATCH_LENGTH = 256 * 1024; --- End diff -- Done. Please review new diffs > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530280#comment-15530280 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80970307 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -115,6 +115,8 @@ private List rowGroupInfos; private Metadata.ParquetTableMetadataBase parquetTableMetadata = null; private String cacheFileRoot = null; + private int batchSize; + private static final int DEFAULT_BATCH_LENGTH = 256 * 1024; --- End diff -- Now I guess I understand what you mean. You want to cap store.parquet.record_batch_size to 256K. And set DEFAULT_BATCH_LENGTH =256k in ParquetGroupScan. At execution time, you pick min of batch_size and value from option, which will be no greater than the option value. If that's correct, can we remove DEFAULT_BATCH_LENGTH in ParquetGroupScan. In stead, use the batch_size specified in the new option you added for NON-LIMIT case? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530259#comment-15530259 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80968936 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java --- @@ -125,13 +125,14 @@ String PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING = "store.parquet.enable_dictionary_encoding"; OptionValidator PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING_VALIDATOR = new BooleanValidator( PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING, false); - String PARQUET_VECTOR_FILL_THRESHOLD = "store.parquet.vector_fill_threshold"; OptionValidator PARQUET_VECTOR_FILL_THRESHOLD_VALIDATOR = new PositiveLongValidator(PARQUET_VECTOR_FILL_THRESHOLD, 99l, 85l); String PARQUET_VECTOR_FILL_CHECK_THRESHOLD = "store.parquet.vector_fill_check_threshold"; OptionValidator PARQUET_VECTOR_FILL_CHECK_THRESHOLD_VALIDATOR = new PositiveLongValidator(PARQUET_VECTOR_FILL_CHECK_THRESHOLD, 100l, 10l); String PARQUET_NEW_RECORD_READER = "store.parquet.use_new_reader"; OptionValidator PARQUET_RECORD_READER_IMPLEMENTATION_VALIDATOR = new BooleanValidator(PARQUET_NEW_RECORD_READER, false); + String PARQUET_RECORD_BATCH_SIZE = "store.parquet.record_batch_size"; --- End diff -- It would be nice to use "store.parquet.reader.record_batch_size", since this option is for reader only, not for parquet writer. We could not change the option name once it's released. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530256#comment-15530256 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80968605 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java --- @@ -125,13 +125,14 @@ String PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING = "store.parquet.enable_dictionary_encoding"; OptionValidator PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING_VALIDATOR = new BooleanValidator( PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING, false); - String PARQUET_VECTOR_FILL_THRESHOLD = "store.parquet.vector_fill_threshold"; OptionValidator PARQUET_VECTOR_FILL_THRESHOLD_VALIDATOR = new PositiveLongValidator(PARQUET_VECTOR_FILL_THRESHOLD, 99l, 85l); String PARQUET_VECTOR_FILL_CHECK_THRESHOLD = "store.parquet.vector_fill_check_threshold"; OptionValidator PARQUET_VECTOR_FILL_CHECK_THRESHOLD_VALIDATOR = new PositiveLongValidator(PARQUET_VECTOR_FILL_CHECK_THRESHOLD, 100l, 10l); String PARQUET_NEW_RECORD_READER = "store.parquet.use_new_reader"; OptionValidator PARQUET_RECORD_READER_IMPLEMENTATION_VALIDATOR = new BooleanValidator(PARQUET_NEW_RECORD_READER, false); + String PARQUET_RECORD_BATCH_SIZE = "store.parquet.record_batch_size"; + OptionValidator PARQUET_RECORD_BATCH_SIZE_VALIDATOR = new LongValidator(PARQUET_RECORD_BATCH_SIZE, 256*1024); --- End diff -- If you want to specify the maximum of allowed batch_size, you probably have to use PositiveLongValidator, which will allow u to specify max. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527763#comment-15527763 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80817027 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -115,6 +115,8 @@ private List rowGroupInfos; private Metadata.ParquetTableMetadataBase parquetTableMetadata = null; private String cacheFileRoot = null; + private int batchSize; + private static final int DEFAULT_BATCH_LENGTH = 256 * 1024; --- End diff -- Max value for store.parquet.record_batch_size is 256K. So, it cannot be set to 512K. I changed the name in ParquetGroupScan/ParquetRowGroupScan to recommendedBatchSize as we discussed. Please review new diffs. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527425#comment-15527425 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80796011 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -115,6 +115,8 @@ private List rowGroupInfos; private Metadata.ParquetTableMetadataBase parquetTableMetadata = null; private String cacheFileRoot = null; + private int batchSize; + private static final int DEFAULT_BATCH_LENGTH = 256 * 1024; --- End diff -- Are you referring to code here: {code} // Pick the minimum of recordsPerBatch calculated above, batchSize we got from rowGroupScan (based on limit) // and user configured batchSize value. recordsPerBatch = (int) Math.min(Math.min(recordsPerBatch, batchSize), fragmentContext.getOptions().getOption(ExecConstants.PARQUET_RECORD_BATCH_SIZE).num_val.intValue()); {code} If I understand correctly, batchSize in ParquetRecordReader comes from ParquetRowGroupScan, which comes from ParquetGroupScan, which is set to DEFAULT_BATCH_LENGTH. If I have a RG with 512K rows, and I set "store.parquet.record_batch_size" to be 512K, will your code honor this 512 batch size, or will it use DEFAULT_BATCH_LENGTH since it's smallest? Also, if "store.parquet.record_batch_size" is set to be different from DEFAULT_BATCH_LENGTH, why would we still use DEFAULT_BATCH_LENGTH in ParquetGroupScan / ParquetRowGroupScan? People might be confused if they look at the serialized physical plan, which shows "batchSize = DEFAULT_BATCH_LENGTH. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527347#comment-15527347 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80788617 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) throws IOException { return newScan; } + // clone to create new groupscan with new file selection and batchSize. + public ParquetGroupScan clone(FileSelection selection, int batchSize) throws IOException { +ParquetGroupScan newScan = new ParquetGroupScan(this); +newScan.modifyFileSelection(selection); +newScan.cacheFileRoot = selection.cacheFileRoot; +newScan.init(selection.getMetaContext()); +newScan.batchSize = batchSize; --- End diff -- done > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527346#comment-15527346 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80788614 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) throws IOException { return newScan; } + // clone to create new groupscan with new file selection and batchSize. + public ParquetGroupScan clone(FileSelection selection, int batchSize) throws IOException { +ParquetGroupScan newScan = new ParquetGroupScan(this); +newScan.modifyFileSelection(selection); +newScan.cacheFileRoot = selection.cacheFileRoot; --- End diff -- done > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527343#comment-15527343 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80788478 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS if (!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val && !isComplex(footers.get(e.getPath( { readers.add( new ParquetRecordReader( - context, e.getPath(), e.getRowGroupIndex(), fs, + context, rowGroupScan.getBatchSize(), e.getPath(), e.getRowGroupIndex(), fs, --- End diff -- done. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527340#comment-15527340 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on the issue: https://github.com/apache/drill/pull/597 updated with review comments. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527288#comment-15527288 ] Padma Penumarthy commented on DRILL-4905: - Doing this fix for native parquet reader only. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526958#comment-15526958 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80756694 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -926,16 +944,22 @@ public GroupScan applyLimit(long maxRecords) { fileNames.add(rowGroupInfo.getPath()); } -if (fileNames.size() == fileSet.size() ) { +// If there is no change in fileSet and maxRecords is >= batchSize, no need to create new groupScan. +if (fileNames.size() == fileSet.size() && (maxRecords >= batchSize) ) { // There is no reduction of rowGroups. Return the original groupScan. logger.debug("applyLimit() does not apply!"); return null; } +// If limit maxRecords is less than batchSize, update batchSize to the limit size. +if (maxRecords < batchSize) { + batchSize = (int) maxRecords; --- End diff -- I tried to avoid this casting. To fix this, have to change code in bunch of places. I was trying to keep code changes minimal so no new regressions are introduced. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526939#comment-15526939 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on the issue: https://github.com/apache/drill/pull/597 comment > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526935#comment-15526935 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80754698 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS if (!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val && !isComplex(footers.get(e.getPath( { readers.add( new ParquetRecordReader( - context, e.getPath(), e.getRowGroupIndex(), fs, + context, rowGroupScan.getBatchSize(), e.getPath(), e.getRowGroupIndex(), fs, --- End diff -- If it's only for one type of parquet reader, please document it in the JIRA, so that people will know this. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526932#comment-15526932 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80754417 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -115,6 +115,8 @@ private List rowGroupInfos; private Metadata.ParquetTableMetadataBase parquetTableMetadata = null; private String cacheFileRoot = null; + private int batchSize; + private static final int DEFAULT_BATCH_LENGTH = 256 * 1024; --- End diff -- If it has nothing to do with the new option, why do they have the same value of 256 * 1024? Just coincidence? Also, since a new option is introduced, if user alters this option and set it to a different value than the default, will the option be honored in parquet reader? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526924#comment-15526924 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80754060 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) throws IOException { return newScan; } + // clone to create new groupscan with new file selection and batchSize. + public ParquetGroupScan clone(FileSelection selection, int batchSize) throws IOException { +ParquetGroupScan newScan = new ParquetGroupScan(this); +newScan.modifyFileSelection(selection); +newScan.cacheFileRoot = selection.cacheFileRoot; --- End diff -- yes, we can. will do. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526883#comment-15526883 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80751210 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS if (!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val && !isComplex(footers.get(e.getPath( { readers.add( new ParquetRecordReader( - context, e.getPath(), e.getRowGroupIndex(), fs, + context, rowGroupScan.getBatchSize(), e.getPath(), e.getRowGroupIndex(), fs, --- End diff -- This fix is done only for native parquet reader. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526881#comment-15526881 ] ASF GitHub Bot commented on DRILL-4905: --- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80751189 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -115,6 +115,8 @@ private List rowGroupInfos; private Metadata.ParquetTableMetadataBase parquetTableMetadata = null; private String cacheFileRoot = null; + private int batchSize; + private static final int DEFAULT_BATCH_LENGTH = 256 * 1024; --- End diff -- The default batch length is used here to compare with the limit value and decide if we want to create new groupscan. New option has nothing to do with this. For normal case, we do not touch/use the option. It is added so we can use it if we want to change it at run time for any reason. > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524922#comment-15524922 ] ASF GitHub Bot commented on DRILL-4905: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80612967 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) throws IOException { return newScan; } + // clone to create new groupscan with new file selection and batchSize. + public ParquetGroupScan clone(FileSelection selection, int batchSize) throws IOException { +ParquetGroupScan newScan = new ParquetGroupScan(this); +newScan.modifyFileSelection(selection); +newScan.cacheFileRoot = selection.cacheFileRoot; +newScan.init(selection.getMetaContext()); +newScan.batchSize = batchSize; --- End diff -- Should we have a setter method for this? Or builder pattern? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524920#comment-15524920 ] ASF GitHub Bot commented on DRILL-4905: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80612914 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) throws IOException { return newScan; } + // clone to create new groupscan with new file selection and batchSize. + public ParquetGroupScan clone(FileSelection selection, int batchSize) throws IOException { +ParquetGroupScan newScan = new ParquetGroupScan(this); +newScan.modifyFileSelection(selection); +newScan.cacheFileRoot = selection.cacheFileRoot; --- End diff -- Should we have a setter method for this? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524921#comment-15524921 ] ASF GitHub Bot commented on DRILL-4905: --- Github user gparai commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80613372 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -926,16 +944,22 @@ public GroupScan applyLimit(long maxRecords) { fileNames.add(rowGroupInfo.getPath()); } -if (fileNames.size() == fileSet.size() ) { +// If there is no change in fileSet and maxRecords is >= batchSize, no need to create new groupScan. +if (fileNames.size() == fileSet.size() && (maxRecords >= batchSize) ) { // There is no reduction of rowGroups. Return the original groupScan. logger.debug("applyLimit() does not apply!"); return null; } +// If limit maxRecords is less than batchSize, update batchSize to the limit size. +if (maxRecords < batchSize) { + batchSize = (int) maxRecords; --- End diff -- Can we avoid the cast? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524755#comment-15524755 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80608344 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS if (!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val && !isComplex(footers.get(e.getPath( { readers.add( new ParquetRecordReader( - context, e.getPath(), e.getRowGroupIndex(), fs, + context, rowGroupScan.getBatchSize(), e.getPath(), e.getRowGroupIndex(), fs, --- End diff -- What about DrillParquetReader? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524754#comment-15524754 ] ASF GitHub Bot commented on DRILL-4905: --- Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/597#discussion_r80606955 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -115,6 +115,8 @@ private List rowGroupInfos; private Metadata.ParquetTableMetadataBase parquetTableMetadata = null; private String cacheFileRoot = null; + private int batchSize; + private static final int DEFAULT_BATCH_LENGTH = 256 * 1024; --- End diff -- why do we have this DEFAULT_BATCH_LENGTH, in addition to the new option of "store.parquet.record_batch_size"? > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read
[ https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524649#comment-15524649 ] ASF GitHub Bot commented on DRILL-4905: --- GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/597 DRILL-4905: Push down the LIMIT to the parquet reader scan. For limit N query, where N is less than current default record batchSize (256K for all fixedlength, 32K otherwise), we still end up reading all 256K/32K rows from disk if rowGroup has that many rows. This causes performance degradation especially when there are large number of columns. This fix tries to address this problem by changing the record batchSize parquet record reader uses so we don't read more than what is needed. Also, added a sys option (store.parquet.record_batch_size) to be able to set record batch size. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-4905 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #597 commit cd665ebdba11f8685ba446f5ec535c81ddd6edc7 Author: Padma Penumarthy Date: 2016-09-26T17:51:07Z DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the numbers of records read > Push down the LIMIT to the parquet reader scan to limit the numbers of > records read > --- > > Key: DRILL-4905 > URL: https://issues.apache.org/jira/browse/DRILL-4905 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Padma Penumarthy >Assignee: Padma Penumarthy > Fix For: 1.9.0 > > > Limit the number of records read from disk by pushing down the limit to > parquet reader. > For queries like > select * from limit N; > where N < size of Parquet row group, we are reading 32K/64k rows or entire > row group. This needs to be optimized to read only N rows. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)