GitHub user parthchandra opened a pull request:
https://github.com/apache/drill/pull/611
Drill-4800: Improve parquet reader performance
Added a Buffering input stream
Updated parquet reader to optionally use the buffering input stream
Added optional asynchronous reading of page data
Added optional parallel decompression and decoding of columns
Decompression of data using Gzip/Snappy bypasses the Parquet APIs and
calls the decompressors directly (there were concurrency issues with using the
Parquet APIs)
Added new operator metrics for asynchronous page reading.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/parthchandra/drill DRILL-4800
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/611.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #611
----
commit 0457d69cae403bc8abcebb90ead55769ec58f5ac
Author: Parth Chandra <[email protected]>
Date: 2016-06-10T21:56:41Z
DRILL-4800: Use a buffering input stream in the Parquet reader
commit a33200107a5180f1b0dbad2b2e5b0905de4ed884
Author: Parth Chandra <[email protected]>
Date: 2016-08-24T17:46:37Z
DRILL-4800: Parallelize column reading.
Read/Decode fixed width fields in parallel
Decoding var length columns in parallel
Use simplified decompress method for Gzip and Snappy decompression.
Avoids concurrency issue with Parquet decompression. (It's also faster).
Stress test Parquet read write
Parallel column reader is disabled by default (may perform less well
under higher concurrency)
commit 8d9c26071b4826bda917ac4e88c70b7351a16d83
Author: Parth Chandra <[email protected]>
Date: 2016-09-27T21:03:35Z
DRILL-4800: Add AsyncPageReader to pipeline PageRead
Use non tracking input stream for Parquet scans.
Make choice between async and sync reader configurable.
Make various options user configurable - choose between sync and async
page reader, enable/disable fadvise
Add Parquet Scan metrics to track time spent in various operations
commit 91658f0cb3bb2ee3ff35a0ffde859052df91527e
Author: Parth Chandra <[email protected]>
Date: 2016-09-14T04:47:49Z
DRILL-4800: Various fixes.
Fix buffer underflow exception in BufferedDirectBufInputStream.
Fix writer index for in64 dictionary encoded types.
Added logging to help debug.
Fix memory leaks.
Work around issues with of InputStream.available() ( Do not use
hasRemainder; Remove check for EOF in BufferedDirectBufInputStream.read() ).
Finalize defaults.
Remove commented code.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---