[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-17 Thread paul-rogers
GitHub user paul-rogers opened a pull request:

https://github.com/apache/drill/pull/749

DRILL-5266: Parquet returns low-density batches

Fixes one glaring problem related to bit/byte confusion.

Includes a few clean-up items found along the way.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/paul-rogers/drill DRILL-5266

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/749.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #749






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102725896
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -376,14 +378,14 @@ public void setup(OperatorContext operatorContext, 
OutputMutator output) throws
   if (dataTypeLength == -1) {
   allFieldsFixedLength = false;
 } else {
-bitWidthAllFixedFields += dataTypeLength;
+  bitWidthAllFixedFields += dataTypeLength;
 }
   }
 //rowGroupOffset = 
footer.getBlocks().get(rowGroupIndex).getColumns().get(0).getFirstDataPageOffset();
 
 if (columnsToScan != 0  && allFieldsFixedLength) {
   recordsPerBatch = (int) Math.min(Math.min(batchSize / 
bitWidthAllFixedFields,
-  footer.getBlocks().get(0).getColumns().get(0).getValueCount()), 
65535);
+  footer.getBlocks().get(0).getColumns().get(0).getValueCount()), 
DEFAULT_RECORDS_TO_READ_IF_VARIABLE_WIDTH);
--- End diff --

This is DEFAULT_RECORDS_TO_READ_FIXED_WIDTH (not VARIABLE)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102726321
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -73,6 +72,7 @@
   private static final long DEFAULT_BATCH_LENGTH = 256 * 1024 * 
NUMBER_OF_VECTORS; // 256kb
   private static final long DEFAULT_BATCH_LENGTH_IN_BITS = 
DEFAULT_BATCH_LENGTH * 8; // 256kb
   private static final char DEFAULT_RECORDS_TO_READ_IF_NOT_FIXED_WIDTH = 
32*1024;
--- End diff --

while we are at it, I would probably rename 
DEFAULT_RECORDS_TO_READ_IF_NOT_FIXED_WIDTH to 
DEFAULT_RECORDS_TO_READ_VARIABLE_WIDTH


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102726705
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
 ---
@@ -70,33 +70,31 @@ public long readFields(long recordsToReadInThisPass, 
ColumnReader firstColumn
 return recordsReadInCurrentPass;
   }
 
-
   private long determineSizesSerial(long recordsToReadInThisPass) throws 
IOException {
-int lengthVarFieldsInCurrentRecord = 0;
-boolean exitLengthDeterminingLoop = false;
-long totalVariableLengthData = 0;
-long recordsReadInCurrentPass = 0;
-do {
+
+// Can't read any more records than fixed width fields will fit.
+// Note: this calculation is very likely wrong; it is a simplified
+// version of earlier code, but probably needs even more attention.
+
+int totalFixedFieldWidth = parentReader.getBitWidthAllFixedFields() / 
8;
+long batchSize = parentReader.getBatchSize();
+if (totalFixedFieldWidth > 0) {
+  recordsToReadInThisPass = Math.min(recordsToReadInThisPass, 
batchSize / totalFixedFieldWidth);
--- End diff --

Instead of fixing it up here, please do it outside the function and pass 
the correct value for recordsToReadInThisPass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102809881
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -73,6 +72,7 @@
   private static final long DEFAULT_BATCH_LENGTH = 256 * 1024 * 
NUMBER_OF_VECTORS; // 256kb
   private static final long DEFAULT_BATCH_LENGTH_IN_BITS = 
DEFAULT_BATCH_LENGTH * 8; // 256kb
   private static final char DEFAULT_RECORDS_TO_READ_IF_NOT_FIXED_WIDTH = 
32*1024;
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102809909
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -376,14 +378,14 @@ public void setup(OperatorContext operatorContext, 
OutputMutator output) throws
   if (dataTypeLength == -1) {
   allFieldsFixedLength = false;
 } else {
-bitWidthAllFixedFields += dataTypeLength;
+  bitWidthAllFixedFields += dataTypeLength;
 }
   }
 //rowGroupOffset = 
footer.getBlocks().get(rowGroupIndex).getColumns().get(0).getFirstDataPageOffset();
 
 if (columnsToScan != 0  && allFieldsFixedLength) {
   recordsPerBatch = (int) Math.min(Math.min(batchSize / 
bitWidthAllFixedFields,
-  footer.getBlocks().get(0).getColumns().get(0).getValueCount()), 
65535);
+  footer.getBlocks().get(0).getColumns().get(0).getValueCount()), 
DEFAULT_RECORDS_TO_READ_IF_VARIABLE_WIDTH);
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102812251
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
 ---
@@ -70,33 +70,31 @@ public long readFields(long recordsToReadInThisPass, 
ColumnReader firstColumn
 return recordsReadInCurrentPass;
   }
 
-
   private long determineSizesSerial(long recordsToReadInThisPass) throws 
IOException {
-int lengthVarFieldsInCurrentRecord = 0;
-boolean exitLengthDeterminingLoop = false;
-long totalVariableLengthData = 0;
-long recordsReadInCurrentPass = 0;
-do {
+
+// Can't read any more records than fixed width fields will fit.
+// Note: this calculation is very likely wrong; it is a simplified
+// version of earlier code, but probably needs even more attention.
+
+int totalFixedFieldWidth = parentReader.getBitWidthAllFixedFields() / 
8;
+long batchSize = parentReader.getBatchSize();
+if (totalFixedFieldWidth > 0) {
+  recordsToReadInThisPass = Math.min(recordsToReadInThisPass, 
batchSize / totalFixedFieldWidth);
--- End diff --

Fixed. Move to the constructor. Changed the batch size in the reader to 
final to guarantee that it can be treated as an invariant.

Please give the revised code a good inspection.

As noted in the comment in the code; this areas is a target-rich 
environment for improvements...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102826142
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
 ---
@@ -33,35 +33,52 @@
   ParquetRecordReader parentReader;
   final List> columns;
   final boolean useAsyncTasks;
+  private final long targetRecordCount;
 
   public VarLenBinaryReader(ParquetRecordReader parentReader, 
List> columns) {
 this.parentReader = parentReader;
 this.columns = columns;
 useAsyncTasks = parentReader.useAsyncColReader;
+
+// Can't read any more records than fixed width fields will fit.
+// Note: this calculation is very likely wrong; it is a simplified
+// version of earlier code, but probably needs even more attention.
+
+int totalFixedFieldWidth = parentReader.getBitWidthAllFixedFields() / 
8;
+if (totalFixedFieldWidth == 0) {
+  targetRecordCount = 0;
+} else {
+  targetRecordCount = parentReader.getBatchSize() / 
totalFixedFieldWidth;
+}
   }
 
   /**
* Reads as many variable length values as possible.
*
* @param recordsToReadInThisPass - the number of records recommended 
for reading form the reader
-   * @param firstColumnStatus - a reference to the first column status in 
the parquet file to grab metatdata from
+   * @param firstColumnStatus - a reference to the first column status in 
the Parquet file to grab metatdata from
* @return - the number of fixed length fields that will fit in the batch
* @throws IOException
*/
   public long readFields(long recordsToReadInThisPass, ColumnReader 
firstColumnStatus) throws IOException {
--- End diff --

Since we are not using firstColumnStatus, can we just remove that ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102824318
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -351,7 +353,7 @@ public void setup(OperatorContext operatorContext, 
OutputMutator output) throws
 
 MaterializedField field;
 //ParquetMetadataConverter metaConverter = new 
ParquetMetadataConverter();
-FileMetaData fileMetaData;
+//FileMetaData fileMetaData;
--- End diff --

Instead of commenting out, just remove this and earlier line. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102830312
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
 ---
@@ -33,35 +33,52 @@
   ParquetRecordReader parentReader;
   final List> columns;
   final boolean useAsyncTasks;
+  private final long targetRecordCount;
 
   public VarLenBinaryReader(ParquetRecordReader parentReader, 
List> columns) {
 this.parentReader = parentReader;
 this.columns = columns;
 useAsyncTasks = parentReader.useAsyncColReader;
+
+// Can't read any more records than fixed width fields will fit.
+// Note: this calculation is very likely wrong; it is a simplified
+// version of earlier code, but probably needs even more attention.
+
+int totalFixedFieldWidth = parentReader.getBitWidthAllFixedFields() / 
8;
+if (totalFixedFieldWidth == 0) {
+  targetRecordCount = 0;
+} else {
+  targetRecordCount = parentReader.getBatchSize() / 
totalFixedFieldWidth;
+}
   }
 
   /**
* Reads as many variable length values as possible.
*
* @param recordsToReadInThisPass - the number of records recommended 
for reading form the reader
-   * @param firstColumnStatus - a reference to the first column status in 
the parquet file to grab metatdata from
+   * @param firstColumnStatus - a reference to the first column status in 
the Parquet file to grab metatdata from
* @return - the number of fixed length fields that will fit in the batch
* @throws IOException
*/
   public long readFields(long recordsToReadInThisPass, ColumnReader 
firstColumnStatus) throws IOException {
 
-long recordsReadInCurrentPass = 0;
-
 // write the first 0 offset
 for (VarLengthColumn columnReader : columns) {
   columnReader.reset();
 }
 Stopwatch timer = Stopwatch.createStarted();
 
-recordsReadInCurrentPass = 
determineSizesSerial(recordsToReadInThisPass);
-if(useAsyncTasks){
+long recordsReadInCurrentPass = 
determineSizesSerial(recordsToReadInThisPass);
+
+// Can't read any more records than fixed width fields will fit.
+
+if (targetRecordCount > 0) {
+  recordsToReadInThisPass = Math.min(recordsToReadInThisPass, 
targetRecordCount);
--- End diff --

I think you mean to update recordsReadInCurrentPass. 
recordsToReadInThisPass is not being used after this. So, what is the point in 
updating ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-02-23 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/749#discussion_r102827651
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
 ---
@@ -70,33 +87,21 @@ public long readFields(long recordsToReadInThisPass, 
ColumnReader firstColumn
 return recordsReadInCurrentPass;
   }
 
-
   private long determineSizesSerial(long recordsToReadInThisPass) throws 
IOException {
-int lengthVarFieldsInCurrentRecord = 0;
-boolean exitLengthDeterminingLoop = false;
-long totalVariableLengthData = 0;
-long recordsReadInCurrentPass = 0;
-do {
+
+int recordsReadInCurrentPass = 0;
+top: do {
   for (VarLengthColumn columnReader : columns) {
-if (!exitLengthDeterminingLoop) {
-  exitLengthDeterminingLoop =
-  columnReader.determineSize(recordsReadInCurrentPass, 
lengthVarFieldsInCurrentRecord);
-} else {
-  break;
+// Return status is "done reading", meaning stop if true.
+if (columnReader.determineSize(recordsReadInCurrentPass, 0 /* 
unused */ )) {
--- End diff --

why not remove the unused parameter ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #749: DRILL-5266: Parquet returns low-density batches

2017-03-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/749


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---