GitHub user ppadma opened a pull request:
https://github.com/apache/drill/pull/597
DRILL-4905: Push down the LIMIT to the parquet reader scan.
For limit N query, where N is less than current default record batchSize
(256K for all fixedlength, 32K otherwise), we still end up reading all 256K/32K
rows from disk if rowGroup has that many rows. This causes performance
degradation especially when there are large number of columns.
This fix tries to address this problem by changing the record batchSize
parquet record reader uses so we don't read more than what is needed.
Also, added a sys option (store.parquet.record_batch_size) to be able to
set record batch size.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ppadma/drill DRILL-4905
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/597.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #597
----
commit cd665ebdba11f8685ba446f5ec535c81ddd6edc7
Author: Padma Penumarthy <[email protected]>
Date: 2016-09-26T17:51:07Z
DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the
numbers of records read
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---