subject:"\[jira\] \[Commented\] \(DRILL\-4905\) Push down the LIMIT to the parquet reader scan to limit the numbers of records read"

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15596788#comment-15596788
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/597


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15596393#comment-15596393
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/597
  
+1

Thanks for the PR. It's good that you added couple of more unit tests. 



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586890#comment-15586890
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/597
  
updated diffs with review comments taken care of.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586883#comment-15586883
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r83965044
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetGroupScan.java
 ---
@@ -117,4 +119,18 @@ public void testSelectEmptyNoCache() throws Exception {
 uex.getMessage()), uex.getMessage().contains(expectedMsg));
 }
   }
+
+  @Test
+  public void testLimit() throws Exception {
+List results = 
testSqlWithResults(String.format("select * from cp.`parquet/limitTest.parquet` 
limit 1"));
--- End diff --

done


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586882#comment-15586882
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r83965017
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -139,6 +155,11 @@ public ParquetRecordReader(
 this.batchSize = batchSize;
 this.footer = footer;
 this.fragmentContext = fragmentContext;
+if (numRecordsToRead == DEFAULT_RECORDS_TO_READ_NOT_SPECIFIED) {
+  this.numRecordsToRead =  
footer.getBlocks().get(rowGroupIndex).getRowCount();
+} else {
+  this.numRecordsToRead = numRecordsToRead;
--- End diff --

Current code handles if numRecordsToRead is not in the range.  So, I am not 
adding additional checks here. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573623#comment-15573623
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r83335915
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetGroupScan.java
 ---
@@ -117,4 +119,18 @@ public void testSelectEmptyNoCache() throws Exception {
 uex.getMessage()), uex.getMessage().contains(expectedMsg));
 }
   }
+
+  @Test
+  public void testLimit() throws Exception {
+List results = 
testSqlWithResults(String.format("select * from cp.`parquet/limitTest.parquet` 
limit 1"));
--- End diff --

In stead of using a new parquet file, you may consider using 
cp.`tpch/nation.parquet`, unless the new file has some properties that the 
existing files do not have. It's preferable to use existing file, since 1) 
reduce drill package size, 2) people is more familiar with tpch sample files; 
making it easier to understand the expected result.



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573624#comment-15573624
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r83335163
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetRecordReaderTest.java
 ---
@@ -637,7 +637,7 @@ public void testPerformance(@Injectable final 
DrillbitContext bitContext,
 final FileSystem fs = new CachedSingleFileSystem(fileName);
 final BufferAllocator allocator = RootAllocatorFactory.newRoot(c);
 for(int i = 0; i < 25; i++) {
-  final ParquetRecordReader rr = new ParquetRecordReader(context, 
256000, fileName, 0, fs,
+  final ParquetRecordReader rr = new ParquetRecordReader(context, 
256000, -1, fileName, 0, fs,
--- End diff --

If no param for "NumRecordsToRead" means rowCount, we do not need to change 
here. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573626#comment-15573626
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r83335982
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetGroupScan.java
 ---
@@ -117,4 +119,18 @@ public void testSelectEmptyNoCache() throws Exception {
 uex.getMessage()), uex.getMessage().contains(expectedMsg));
 }
   }
+
+  @Test
+  public void testLimit() throws Exception {
+List results = 
testSqlWithResults(String.format("select * from cp.`parquet/limitTest.parquet` 
limit 1"));
--- End diff --

Also, can you add a unit test with LIMINT n , where n > # of rows in the 
table? 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573621#comment-15573621
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r83334623
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -139,6 +155,11 @@ public ParquetRecordReader(
 this.batchSize = batchSize;
 this.footer = footer;
 this.fragmentContext = fragmentContext;
+if (numRecordsToRead == DEFAULT_RECORDS_TO_READ_NOT_SPECIFIED) {
+  this.numRecordsToRead =  
footer.getBlocks().get(rowGroupIndex).getRowCount();
+} else {
+  this.numRecordsToRead = numRecordsToRead;
--- End diff --

I assume numRecordsToRead should be within range of [1, rowCount].  Do we 
need enforce such range? What if I pass in 0 or  rowCount + 100 ? 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573622#comment-15573622
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r83339590
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -913,19 +928,25 @@ public GroupScan applyLimit(long maxRecords) {
 long count = 0;
 int index = 0;
 for (RowGroupInfo rowGroupInfo : rowGroupInfos) {
-  if (count < maxRecords) {
-count += rowGroupInfo.getRowCount();
+  long rowCount = rowGroupInfo.getRowCount();
--- End diff --

List rowGroupInfos is populated in init() call, when ParquetGroupScan. 
Here, when DrillPushLimitIntoScanRule is fired for the first time, if we reduce 
parquet files, and come to Line 959, we will re-populate rowGroupInfos list. 

The reason that your code works as expected is that 
DrillPushLimitIntoScanRule is fired twice. In the second rule execution, the 
file # is not reduced, but the rowGroupInfos list is updated in this for loop 
block.

However, I think it's not optimal to fire the rule twice. Ideally, we 
should avoid the second firing, since supposely it does nothing (that's a 
separate issue).  We do not want to put code and rely on the assumption that 
this rule will be always fired twice.

Probably, we should update RowGroupInfos after line 959, after new group 
scan is created, and update its RowGroupInfos. 


 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573625#comment-15573625
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r83334160
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -115,17 +118,30 @@
   public ParquetRecordReader(FragmentContext fragmentContext,
   String path,
   int rowGroupIndex,
+  long numRecordsToRead,
   FileSystem fs,
   CodecFactory codecFactory,
   ParquetMetadata footer,
   List columns) throws ExecutionSetupException {
-this(fragmentContext, DEFAULT_BATCH_LENGTH_IN_BITS, path, 
rowGroupIndex, fs, codecFactory, footer,
-columns);
+this(fragmentContext, DEFAULT_BATCH_LENGTH_IN_BITS, numRecordsToRead,
+ path, rowGroupIndex, fs, codecFactory, footer, columns);
+  }
+
+  public ParquetRecordReader(FragmentContext fragmentContext,
+  String path,
+  int rowGroupIndex,
+  FileSystem fs,
+  CodecFactory codecFactory,
+  ParquetMetadata footer,
+  List columns) throws ExecutionSetupException {
+  this(fragmentContext, DEFAULT_BATCH_LENGTH_IN_BITS, 
DEFAULT_RECORDS_TO_READ_NOT_SPECIFIED,
--- End diff --

I think we can get rid of DEFAULT_RECORDS_TO_READ_NOT_SPECIFIED. If  
numRecordsToRead is not specified,  just use rowCount in the row Group. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-10-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566419#comment-15566419
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/597
  
updated with new diffs. Please review.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534211#comment-15534211
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81245030
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -864,6 +872,14 @@ public String getDigest() {
 return toString();
   }
 
+  public void setCacheFileRoot(String cacheFileRoot) {
+this.cacheFileRoot = cacheFileRoot;
+  }
+
+  public void setBatchSize(long batchSize) {
+this.recommendedBatchSize = batchSize;
--- End diff --

I feel it's fine we do not have such check.  Max value for option is 
checked when users tries "alter session set" command. 

For "limit n" query where n is > max value, ParquetRecordReader would pick 
min of two values, essentially ignore this recommendedBatchSize (?).



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534206#comment-15534206
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81244826
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -864,6 +872,14 @@ public String getDigest() {
 return toString();
   }
 
+  public void setCacheFileRoot(String cacheFileRoot) {
+this.cacheFileRoot = cacheFileRoot;
+  }
+
+  public void setBatchSize(long batchSize) {
+this.recommendedBatchSize = batchSize;
--- End diff --

Yes, sure.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534201#comment-15534201
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81244611
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -864,6 +872,14 @@ public String getDigest() {
 return toString();
   }
 
+  public void setCacheFileRoot(String cacheFileRoot) {
+this.cacheFileRoot = cacheFileRoot;
+  }
+
+  public void setBatchSize(long batchSize) {
+this.recommendedBatchSize = batchSize;
--- End diff --

checking against current value is fine for now.  Adding new methods will be 
a bit outside the scope of this JIRA.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534172#comment-15534172
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81242837
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -864,6 +872,14 @@ public String getDigest() {
 return toString();
   }
 
+  public void setCacheFileRoot(String cacheFileRoot) {
+this.cacheFileRoot = cacheFileRoot;
+  }
+
+  public void setBatchSize(long batchSize) {
+this.recommendedBatchSize = batchSize;
--- End diff --

This will check against the current value of the option rather than the max 
value? I thought we want to ensure never to exceed the MAX value of 
PARQUET_READER_RECORD_BATCH_SIZE - did I misunderstand? I looked into the code 
and it seems like we don't have access to the max value - we would need to add 
new methods to expose the same OR maybe invoke the validate method?


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534097#comment-15534097
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81238087
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -864,6 +872,14 @@ public String getDigest() {
 return toString();
   }
 
+  public void setCacheFileRoot(String cacheFileRoot) {
+this.cacheFileRoot = cacheFileRoot;
+  }
+
+  public void setBatchSize(long batchSize) {
+this.recommendedBatchSize = batchSize;
--- End diff --

ok. Added the assert. Check updated diffs.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534031#comment-15534031
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81232963
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -864,6 +872,14 @@ public String getDigest() {
 return toString();
   }
 
+  public void setCacheFileRoot(String cacheFileRoot) {
+this.cacheFileRoot = cacheFileRoot;
+  }
+
+  public void setBatchSize(long batchSize) {
+this.recommendedBatchSize = batchSize;
--- End diff --

Would it be good to add an assert to check recommendedBatchSize <= 
PARQUET_READER_RECORD_BATCH_SIZE max_value?


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533995#comment-15533995
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/597
  
+1

LGTM.

If possible, please run a performance workload.  The patch supposedly 
should not impact the performance workload. But in case I might miss something, 
it would be good to run the performance regression to make sure. 



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533979#comment-15533979
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/597
  
updated with all review comments taken care of. Please review.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533511#comment-15533511
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81196419
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -894,11 +910,21 @@ public GroupScan clone(List columns) {
   public FileGroupScan clone(FileSelection selection) throws IOException {
 ParquetGroupScan newScan = new ParquetGroupScan(this);
 newScan.modifyFileSelection(selection);
-newScan.cacheFileRoot = selection.cacheFileRoot;
+newScan.setCacheFileRoot(selection.cacheFileRoot);
 newScan.init(selection.getMetaContext());
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
--- End diff --

Is it possible to share the code with the method in line 910 - 914? 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533501#comment-15533501
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81196052
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -926,16 +952,22 @@ public GroupScan applyLimit(long maxRecords) {
   fileNames.add(rowGroupInfo.getPath());
 }
 
-if (fileNames.size() == fileSet.size() ) {
+// If there is no change in fileSet and maxRecords is >= batchSize, no 
need to create new groupScan.
+if (fileNames.size() == fileSet.size() && (maxRecords >= 
recommendedBatchSize) ) {
   // There is no reduction of rowGroups. Return the original groupScan.
   logger.debug("applyLimit() does not apply!");
   return null;
 }
 
+// If limit maxRecords is less than batchSize, update batchSize to the 
limit size.
+if (maxRecords < recommendedBatchSize) {
+  recommendedBatchSize = (int) maxRecords;
+}
+
 try {
   FileSelection newSelection = new FileSelection(null, 
Lists.newArrayList(fileNames), getSelectionRoot(), cacheFileRoot, false);
   logger.debug("applyLimit() reduce parquet file # from {} to {}", 
fileSet.size(), fileNames.size());
-  return this.clone(newSelection);
+  return this.clone(newSelection, recommendedBatchSize);
--- End diff --

I feel in case  file selection is unchanged and maxRecords < 
recommenedBatchSize, we do not have to re-create a new parquetGroupScan. In 
such case, all we need is to re-set batchsize. Recreate a parquetgroup with 
same fileselection would incur overhead of reading parquet metadata.  


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533479#comment-15533479
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r81194437
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -926,16 +944,22 @@ public GroupScan applyLimit(long maxRecords) {
   fileNames.add(rowGroupInfo.getPath());
 }
 
-if (fileNames.size() == fileSet.size() ) {
+// If there is no change in fileSet and maxRecords is >= batchSize, no 
need to create new groupScan.
+if (fileNames.size() == fileSet.size() && (maxRecords >= batchSize) ) {
   // There is no reduction of rowGroups. Return the original groupScan.
   logger.debug("applyLimit() does not apply!");
   return null;
 }
 
+// If limit maxRecords is less than batchSize, update batchSize to the 
limit size.
+if (maxRecords < batchSize) {
+  batchSize = (int) maxRecords;
--- End diff --

Maybe I miss something here. But seems to me as long as you declare "long 
batchSize" in ParquetGroupScan/RowGroupScan (All the code seems to be new 
code), you do not have to have this cast. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530521#comment-15530521
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80990803
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

Done. Please review new diffs


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530280#comment-15530280
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80970307
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

Now I guess I understand what you mean.  You want to cap 
store.parquet.record_batch_size to 256K. And set DEFAULT_BATCH_LENGTH =256k in 
ParquetGroupScan. At execution time, you pick min of batch_size and value from 
option, which will be no greater than the option value. 

If that's correct, can we remove DEFAULT_BATCH_LENGTH in ParquetGroupScan. 
In stead, use the batch_size specified in the new option you added for 
NON-LIMIT case? 




> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530259#comment-15530259
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80968936
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -125,13 +125,14 @@
   String PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING = 
"store.parquet.enable_dictionary_encoding";
   OptionValidator PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING_VALIDATOR = 
new BooleanValidator(
   PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING, false);
-
   String PARQUET_VECTOR_FILL_THRESHOLD = 
"store.parquet.vector_fill_threshold";
   OptionValidator PARQUET_VECTOR_FILL_THRESHOLD_VALIDATOR = new 
PositiveLongValidator(PARQUET_VECTOR_FILL_THRESHOLD, 99l, 85l);
   String PARQUET_VECTOR_FILL_CHECK_THRESHOLD = 
"store.parquet.vector_fill_check_threshold";
   OptionValidator PARQUET_VECTOR_FILL_CHECK_THRESHOLD_VALIDATOR = new 
PositiveLongValidator(PARQUET_VECTOR_FILL_CHECK_THRESHOLD, 100l, 10l);
   String PARQUET_NEW_RECORD_READER = "store.parquet.use_new_reader";
   OptionValidator PARQUET_RECORD_READER_IMPLEMENTATION_VALIDATOR = new 
BooleanValidator(PARQUET_NEW_RECORD_READER, false);
+  String PARQUET_RECORD_BATCH_SIZE = "store.parquet.record_batch_size";
--- End diff --

It would be nice to use "store.parquet.reader.record_batch_size", since 
this option is for reader only, not for parquet writer.  We could not change 
the option name once it's released. 




> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530256#comment-15530256
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80968605
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -125,13 +125,14 @@
   String PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING = 
"store.parquet.enable_dictionary_encoding";
   OptionValidator PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING_VALIDATOR = 
new BooleanValidator(
   PARQUET_WRITER_ENABLE_DICTIONARY_ENCODING, false);
-
   String PARQUET_VECTOR_FILL_THRESHOLD = 
"store.parquet.vector_fill_threshold";
   OptionValidator PARQUET_VECTOR_FILL_THRESHOLD_VALIDATOR = new 
PositiveLongValidator(PARQUET_VECTOR_FILL_THRESHOLD, 99l, 85l);
   String PARQUET_VECTOR_FILL_CHECK_THRESHOLD = 
"store.parquet.vector_fill_check_threshold";
   OptionValidator PARQUET_VECTOR_FILL_CHECK_THRESHOLD_VALIDATOR = new 
PositiveLongValidator(PARQUET_VECTOR_FILL_CHECK_THRESHOLD, 100l, 10l);
   String PARQUET_NEW_RECORD_READER = "store.parquet.use_new_reader";
   OptionValidator PARQUET_RECORD_READER_IMPLEMENTATION_VALIDATOR = new 
BooleanValidator(PARQUET_NEW_RECORD_READER, false);
+  String PARQUET_RECORD_BATCH_SIZE = "store.parquet.record_batch_size";
+  OptionValidator PARQUET_RECORD_BATCH_SIZE_VALIDATOR = new 
LongValidator(PARQUET_RECORD_BATCH_SIZE, 256*1024);
--- End diff --

If you want to specify the maximum of allowed batch_size, you probably have 
to use PositiveLongValidator, which will allow u to specify max.  



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527763#comment-15527763
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80817027
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

Max value for store.parquet.record_batch_size is 256K. So, it cannot be set 
to 512K. I changed the name in ParquetGroupScan/ParquetRowGroupScan to 
recommendedBatchSize as we discussed. Please review new diffs.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527425#comment-15527425
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80796011
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

Are you referring to code here:
{code}
// Pick the minimum of recordsPerBatch calculated above, batchSize we 
got from rowGroupScan (based on limit)
// and user configured batchSize value.
recordsPerBatch = (int) Math.min(Math.min(recordsPerBatch, batchSize),
 
fragmentContext.getOptions().getOption(ExecConstants.PARQUET_RECORD_BATCH_SIZE).num_val.intValue());
{code}

If I understand correctly, batchSize in ParquetRecordReader comes from 
ParquetRowGroupScan, which comes from ParquetGroupScan, which is set to 
DEFAULT_BATCH_LENGTH.  If I have a RG with 512K rows, and I set 
"store.parquet.record_batch_size" to be 512K, will your code honor this 512 
batch size, or will it use DEFAULT_BATCH_LENGTH since it's smallest? 

Also, if "store.parquet.record_batch_size" is set to be different from 
DEFAULT_BATCH_LENGTH, why would we still use DEFAULT_BATCH_LENGTH in 
ParquetGroupScan / ParquetRowGroupScan?  People might be confused if they look 
at the serialized physical plan, which shows "batchSize = DEFAULT_BATCH_LENGTH. 



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527347#comment-15527347
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80788617
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
+newScan.init(selection.getMetaContext());
+newScan.batchSize = batchSize;
--- End diff --

done


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527346#comment-15527346
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80788614
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
--- End diff --

done


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527343#comment-15527343
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80788478
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
   readers.add(
   new ParquetRecordReader(
-  context, e.getPath(), e.getRowGroupIndex(), fs,
+  context, rowGroupScan.getBatchSize(), e.getPath(), 
e.getRowGroupIndex(), fs,
--- End diff --

done.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527340#comment-15527340
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/597
  
updated with review comments.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread Padma Penumarthy (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527288#comment-15527288
 ] 

Padma Penumarthy commented on DRILL-4905:
-

Doing this fix for native parquet reader only.

> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526958#comment-15526958
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80756694
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -926,16 +944,22 @@ public GroupScan applyLimit(long maxRecords) {
   fileNames.add(rowGroupInfo.getPath());
 }
 
-if (fileNames.size() == fileSet.size() ) {
+// If there is no change in fileSet and maxRecords is >= batchSize, no 
need to create new groupScan.
+if (fileNames.size() == fileSet.size() && (maxRecords >= batchSize) ) {
   // There is no reduction of rowGroups. Return the original groupScan.
   logger.debug("applyLimit() does not apply!");
   return null;
 }
 
+// If limit maxRecords is less than batchSize, update batchSize to the 
limit size.
+if (maxRecords < batchSize) {
+  batchSize = (int) maxRecords;
--- End diff --

I tried to avoid this casting.  To fix this, have to change code in bunch 
of places.  I was trying to keep code changes minimal so no new regressions are 
introduced.   


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526939#comment-15526939
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/597
  
comment


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526935#comment-15526935
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80754698
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
   readers.add(
   new ParquetRecordReader(
-  context, e.getPath(), e.getRowGroupIndex(), fs,
+  context, rowGroupScan.getBatchSize(), e.getPath(), 
e.getRowGroupIndex(), fs,
--- End diff --

If it's only for one type of parquet reader, please document it in the 
JIRA, so that people will know this. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526932#comment-15526932
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80754417
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

If it has nothing to do with the new option, why do they have the same 
value of 256 * 1024? Just  coincidence? 

Also, since a new option is introduced, if user alters this option and set 
it to a different value than the default, will the option be honored in parquet 
reader? 



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526924#comment-15526924
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80754060
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
--- End diff --

yes, we can.  will do.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526883#comment-15526883
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80751210
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
   readers.add(
   new ParquetRecordReader(
-  context, e.getPath(), e.getRowGroupIndex(), fs,
+  context, rowGroupScan.getBatchSize(), e.getPath(), 
e.getRowGroupIndex(), fs,
--- End diff --

This fix is done only for native parquet reader. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526881#comment-15526881
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80751189
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

The default batch length is used here to compare with the limit value and 
decide if we want to create new groupscan.  New option has nothing to do with 
this. For normal case, we do not touch/use the option.  It is added so we can 
use it if we want to change it at run time for any reason. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524922#comment-15524922
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80612967
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
+newScan.init(selection.getMetaContext());
+newScan.batchSize = batchSize;
--- End diff --

Should we have a setter method for this? Or builder pattern?


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524920#comment-15524920
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80612914
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
--- End diff --

Should we have a setter method for this?


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524921#comment-15524921
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80613372
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -926,16 +944,22 @@ public GroupScan applyLimit(long maxRecords) {
   fileNames.add(rowGroupInfo.getPath());
 }
 
-if (fileNames.size() == fileSet.size() ) {
+// If there is no change in fileSet and maxRecords is >= batchSize, no 
need to create new groupScan.
+if (fileNames.size() == fileSet.size() && (maxRecords >= batchSize) ) {
   // There is no reduction of rowGroups. Return the original groupScan.
   logger.debug("applyLimit() does not apply!");
   return null;
 }
 
+// If limit maxRecords is less than batchSize, update batchSize to the 
limit size.
+if (maxRecords < batchSize) {
+  batchSize = (int) maxRecords;
--- End diff --

Can we avoid the cast?


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524755#comment-15524755
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80608344
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
   readers.add(
   new ParquetRecordReader(
-  context, e.getPath(), e.getRowGroupIndex(), fs,
+  context, rowGroupScan.getBatchSize(), e.getPath(), 
e.getRowGroupIndex(), fs,
--- End diff --

What about DrillParquetReader? 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524754#comment-15524754
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80606955
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

why do we have this DEFAULT_BATCH_LENGTH, in addition to the new option of 
"store.parquet.record_batch_size"? 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524649#comment-15524649
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/597

DRILL-4905: Push down the LIMIT to the parquet reader scan.

For limit N query, where N is less than current default record batchSize 
(256K for all fixedlength, 32K otherwise), we still end up reading all 256K/32K 
rows from disk if rowGroup has that many rows. This  causes performance 
degradation especially when there are large number of columns. 
This fix tries to address this problem by changing the record batchSize 
parquet record reader uses so we don't read more than what is needed.
Also, added a sys option (store.parquet.record_batch_size) to be able to 
set record batch size.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-4905

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #597


commit cd665ebdba11f8685ba446f5ec535c81ddd6edc7
Author: Padma Penumarthy 
Date:   2016-09-26T17:51:07Z

DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the 
numbers of records read




> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

47 matches

Mail list logo