GitHub user arina-ielchiieva opened a pull request:
https://github.com/apache/drill/pull/1030
DRILL-5941: Skip header / footer improvements for Hive storage plugin
Overview:
1. When table has header / footer process input splits fo the same file in
one reader (bug fix for DRILL-5941).
2. Apply skip header logic during reader initialization only once to avoid
checks during reading the data (DRILL-5106).
3. Apply skip footer logic only when footer is more then 0, otherwise
default processing will be done without buffering data in queue (DRIL-5106).
Code changes:
1. AbstractReadersInitializer was introduced to factor out common logic
during readers intialization.
It will have three implementations:
a. Default (each input split gets its own reader);
b. Empty (for empty tables);
c. InputSplitGroups (applied when table has header / footer and input
splits of the same file should be processed together).
2. AbstractRecordsInspector was introduced to improve performance when
table has footer is less or equals to 0.
It will have two implementations:
a. Default (records will be processed one by one without buffering);
b. SkipFooter (queue will be used to buffer N records that should be
skipped in the end of file processing).
3. Allow HiveAbstractReader to have multiple input splits.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/arina-ielchiieva/drill DRILL-5941
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1030.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1030
----
----
---