sachouche opened a new pull request #1630: DRILL-7018: Fixed Parquet buffer overflow when reading timestamp column URL: https://github.com/apache/drill/pull/1630 **Problem Description** - Parquet advertises timestamps as a 12 bytes precision (INT96) - By default, such entries are mapped to a Drill VARBINARY data type When the "store.parquet.reader.int96_as_timestamp" option is set, then the Parquet reader will map this DT to a Drill data type but with less precision (8 bytes; nanoseconds will be lost) - The previous Drill code (before the batch sizing feature) used to pre-allocate buffers (initially) with at least 4096 values - The current parquet file has a timestamp column with 31 null values - The code attempts to advance the write offset to 31 * 12 (the original column precision); the correct precision is based for non-null values - The old Drill version will work since it had plenty of space pre-allocated - The batch sizing feature is smarter as it allocated the strict minimum: 31 * 8 bytes - Unfortunately, this uncovered the 1.10 bug where the INT96 precision was used instead of the Drill timestamp precision of 8 bytes - My guess, Drill 1.10 has the same issue for larger datasets **NOTE** - Setting the writer offset (when encountering null values) to an erroneous value is not leading to corruption - My guess is that setting the data uses a mutator which uses entry based indexing (since it is fixed length); thus, the index writer offset is not causing harm **FIX** - The fix is to modify the Parquet fixed reader code to use the Drill DT precision instead - Looking at the code, only the nullable fixed reader is affected with this bug
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
