[
https://issues.apache.org/jira/browse/DRILL-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151012#comment-17151012
]
ASF GitHub Bot commented on DRILL-7763:
---------------------------------------
cgivre commented on pull request #2092:
URL: https://github.com/apache/drill/pull/2092#issuecomment-653570450
@vvysotskyi
Thanks for taking a look.
> @cgivre, how it would work for the case when there was created multiple
fragments with their own scan? From the code, it looks like every fragment
would read the same number of rows specified in the limit. Also, will the limit
operator be preserved in the plan if the scan supports limit pushdown?
Firstly, the format plugin has to explicitly enable the pushdown. I don't
have the best test infrastructure, so maybe you could assist with that, but I
do believe that each fragment would read the same number of rows in their own
scan. Ideally, I'd like to fix that, but let's say you have 5 scans that are
reading files with 1000 rows and you put a limit of 100 on the query. Without
this PR, my observation was that Drill will still read 5000 rows, whereas with
this PR, it will only reduce that to 500.
>
> Metastore also provides capabilities for pushing the limit, but it works
slightly differently - it prunes files and leaves only minimum files number
with specific row count. Would these two features coexist and work correctly?
I didn't know about this feature in the metastore. I would like for these
features to coexist if possible. Could you point me to some resources, or docs
for this so that I can take a look? Ideally, I'd like to make it such that we
get the minimum files number from the metastore AND we get the row limit as
well, so that we are looking at the absolute minimum amount of data.
For some background I was working on a project where I had several GB of
PCAP files in multiple directories. I found that Drill could query these files
fairly rapidly, but it seemed to still have a lot of overhead in terms of how
many files it was actually reading. Separately, when I was working on the
Splunk plugin (https://github.com/apache/drill/pull/2089), I discovered that
virtually no storage plugins actually seemed to have a limit pushdown. This
was puzzling since the rules and logic for this were actually already in Drill
and in the GroupScan. On top of that, it's actually a fairly easy addition.
Getting back to this PR, I wanted to see if it made a performance difference
on querying some large files on my machine and the difference was shocking.
Simple queries and queries with a `WHERE` clause, which used to take seconds,
would now be virtually instantaneous. The difference is user experience is
really shocking.
Anyway, I'd appreciate any help you can give with respect to the metastore
and incorporating that into the PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Add Limit Pushdown to File Based Storage Plugins
> ------------------------------------------------
>
> Key: DRILL-7763
> URL: https://issues.apache.org/jira/browse/DRILL-7763
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.17.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.18.0
>
>
> As currently implemented, when querying a file, Drill will read the entire
> file even if a limit is specified in the query. This PR does a few things:
> # Refactors the EasyGroupScan, EasySubScan, and EasyFormatConfig to allow
> the option of pushing down limits.
> # Applies this to all the EVF based format plugins which are: LogRegex,
> PCAP, SPSS, Esri, Excel and Text (CSV).
> Due to JSON's fluid schema, it would be unwise to adopt the limit pushdown as
> it could result in very inconsistent schemata.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)