[ 
https://issues.apache.org/jira/browse/DRILL-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacques Nadeau updated DRILL-1950:
----------------------------------
    Assignee:     (was: Jacques Nadeau)

> Implement filter pushdown for Parquet
> -------------------------------------
>
>                 Key: DRILL-1950
>                 URL: https://issues.apache.org/jira/browse/DRILL-1950
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Jason Altekruse
>            Priority: Critical
>             Fix For: 1.7.0
>
>         Attachments: DRILL-1950.1.patch.txt
>
>
> The parquet reader currently supports project pushdown, for limiting the 
> number of columns read, however it does not use filter pushdown to read a 
> subset of the requested columns. This is particularly useful with parquet 
> files that contain statistics, most importantly min and max values on pages. 
> Evaluating predicates against these values could save some major reading and 
> decoding time.
> The largest barrier to implementing this is the current design of the reader. 
> Firstly, we currently have two separate parquet readers, one for reading flat 
> files very quickly and another or reading complex data. There are 
> enhancements we can make the the flat reader, to make it support nested data 
> in a much more efficient manner. However the speed of the flat file reader 
> currently comes from being able to make vectorized copies out the the parquet 
> file. This design is somewhat at odds with filter pushdown, as we will only 
> can make useful vectorized copies if the filter matches a large run of values 
> within the file. This might not be too rare a case, assuming files are often 
> somewhat sorted on a primary field like date or a numeric key, and these are 
> often fields used to limit the query to a subset of the data. However for 
> cases where we are filter out a few records here and there, we should just 
> make individual copies.
> We need to do more design work on the best way to balance performance with 
> these use cases in mind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to