[jira] [Created] (DRILL-8084) Scan LIMIT pushdown fails across files

Paul Rogers (Jira) Sun, 19 Dec 2021 15:05:04 -0800

Paul Rogers created DRILL-8084:
----------------------------------

             Summary: Scan LIMIT pushdown fails across files
                 Key: DRILL-8084
                 URL: https://issues.apache.org/jira/browse/DRILL-8084
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.19.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers



DRILL-7763 apparently added limit pushdowns to the file format plugins, which 
is a nice improvement. Unfortunately, the implementation only works for a scan 
with a single file: the limit is applied to each file independently. The 
correct implementation is to apply the limit to the {_}scan{_}, not the 
{_}file{_}.

Further, `LIMIT 0` has meaning: it asks to return a schema with no data. 
However, the implementation uses a {{maxRecords == 0}} to mean no limit, and a 
bit of code explicitly changes `LIMIT 0` to `LIMIT 1` so that "we read at least 
one file".

Consider and example. Two files, A and B, each of which have 10 records:
 * {{{}LIMIT 0{}}}: Obtain the schema from A, read no data from A. Do not open 
B. The current code changes {{LIMIT 0}} to {{{}LIMIT 1{}}}, thus returning data.
 * {{{}LIMIT 1{}}}: Read one record from A, none from B. (Don't even open B.) 
The current code will read 1 record from A and other from B.
 * {{{}LIMIT 15{}}}: Read all 10 records from A, and only 5 from B. The current 
code applies the limit of 15 to both files, thus reading 20 records.

The correct solution is to manage the {{LIMIT}} at the scan level. As each file 
completes, subtract the returned row count from the limit applied to the next 
file.

And, at the file level, there is no need to have each file count its records 
and check the limit on each row read. The "result set loader" already checks 
batch limits: it is the place to check the overall limit.

For this reason, the V2 EVF scan framework has been extended to manage the 
scan-level part, and the "result set loader" has been extended to enforce the 
per-file limit. The result is that readers need do...absolutely nothing; 
{{LIMIT}} pushdown is automatic.

EVF V1 has also been extended, but is less thoroughly tested since the desired 
path is to upgrade all readers to use EVF V2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (DRILL-8084) Scan LIMIT pushdown fails across files

Reply via email to