[ https://issues.apache.org/jira/browse/DRILL-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Gilmore updated DRILL-1950: -------------------------------- Attachment: (was: DRILL-1950.1.patch.txt) > Implement filter pushdown for Parquet > ------------------------------------- > > Key: DRILL-1950 > URL: https://issues.apache.org/jira/browse/DRILL-1950 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet > Reporter: Jason Altekruse > Assignee: Jacques Nadeau > Fix For: Future > > Attachments: DRILL-1950.1.patch.txt > > > The parquet reader currently supports project pushdown, for limiting the > number of columns read, however it does not use filter pushdown to read a > subset of the requested columns. This is particularly useful with parquet > files that contain statistics, most importantly min and max values on pages. > Evaluating predicates against these values could save some major reading and > decoding time. > The largest barrier to implementing this is the current design of the reader. > Firstly, we currently have two separate parquet readers, one for reading flat > files very quickly and another or reading complex data. There are > enhancements we can make the the flat reader, to make it support nested data > in a much more efficient manner. However the speed of the flat file reader > currently comes from being able to make vectorized copies out the the parquet > file. This design is somewhat at odds with filter pushdown, as we will only > can make useful vectorized copies if the filter matches a large run of values > within the file. This might not be too rare a case, assuming files are often > somewhat sorted on a primary field like date or a numeric key, and these are > often fields used to limit the query to a subset of the data. However for > cases where we are filter out a few records here and there, we should just > make individual copies. > We need to do more design work on the best way to balance performance with > these use cases in mind. -- This message was sent by Atlassian JIRA (v6.3.4#6332)