[GitHub] sachouche opened a new pull request #1630: DRILL-7018: Fixed Parquet buffer overflow when reading timestamp column

GitBox Wed, 30 Jan 2019 14:32:51 -0800

sachouche opened a new pull request #1630: DRILL-7018: Fixed Parquet buffer 
overflow when reading timestamp column
URL: https://github.com/apache/drill/pull/1630
 
 
   **Problem Description**
   - Parquet advertises timestamps as a 12 bytes precision (INT96)
   - By default, such entries are mapped to a Drill VARBINARY data type
   When the "store.parquet.reader.int96_as_timestamp" option is set, then the 
Parquet reader will map this DT to a Drill data type but with less precision (8 
bytes; nanoseconds will be lost)
   - The previous Drill code (before the batch sizing feature) used to 
pre-allocate buffers (initially) with at least 4096 values
   - The current parquet file has a timestamp column with 31 null values
   - The code attempts to advance the write offset to 31 * 12 (the original 
column precision); the correct precision is based for non-null values
   - The old Drill version will work since it had plenty of space pre-allocated
   - The batch sizing feature is smarter as it allocated the strict minimum: 31 
* 8 bytes
   - Unfortunately, this uncovered the 1.10 bug where the INT96 precision was 
used instead of the Drill timestamp precision of 8 bytes
   - My guess, Drill 1.10 has the same issue for larger datasets
   
   **NOTE**
   - Setting the writer offset (when encountering null values) to an erroneous 
value is not leading to corruption
   - My guess is that setting the data uses a mutator which uses entry based 
indexing (since it is fixed length); thus, the index writer offset is not 
causing harm
   
   **FIX**
   - The fix is to modify the Parquet fixed reader code to use the Drill DT 
precision instead
   - Looking at the code, only the nullable fixed reader is affected with this 
bug


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] sachouche opened a new pull request #1630: DRILL-7018: Fixed Parquet buffer overflow when reading timestamp column

Reply via email to