[ https://issues.apache.org/jira/browse/DRILL-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Gilmore updated DRILL-1950: -------------------------------- Attachment: DRILL-1950.1.patch.txt Attached is a patch for filter pushdown for Parquet. I've added a new option to enable it (it is disabled by default): store.parquet.enable_pushdown_filter Basically, the rule will attempt to push the filter into the scan and the scan will use the Parquet footer statistics to attempt to eliminate row groups. If row groups will be eliminated, the row count will be reduced so the optimizer will pick that plan. If no row groups would be eliminated, it's probably not better to push down the filter anyway. The only caveat here is that I've had to add an EmptyRowGroupScan in cases where *all* row groups have been eliminated. Not sure if anyone can suggest a better way. For the flat reader, this purely does row group elimination while keeping the existing filter in the plan. I think this is wise as our filters are probably far more efficient than implementing "row by row" filtering at the scan. For the complex/new reader, this will push it right down to the scan (as this is trivial due it being provided by the Parquet library). If all expressions are pushed down, then we eliminate the filter entirely. We may decide it is better to only do row group elimination when the entire expression is not pushed down (as we're effectively doubling a lot of the filtering efforts). I haven't explicitly performance tested that part of it. There is also a bug in the Parquet project which, when resolved, will make the row group elimination code nicer (as the RowGroupFilter provided by Parquet will actually check schema compatibility first, rather than now how we simply catch exceptions). https://issues.apache.org/jira/browse/PARQUET-247 Finally, we also infer the filter types from the query, e.g.: {code} select * from table where amount > 5 {code} will infer the "5" as an integer, whereas: {code} select * from table where amount > 5.0 {code} will infer the "5.0" as a double. The filter types must match the schema of the Parquet files, otherwise we will not be able to filter row groups or records. An improvement may be that we actually try to match the filter based on the schema of each Parquet file we're querying. This would mean, though, that we'd really have to pushdown a LogicalExpression instead to the reader (after first pruning it for only valid expressions) or pushdown multiple FilterPredicates (one for each Parquet file). Ultimately, though, this patch should give us great value and will hopefully set a baseline to add some improvements to in terms of performance and flexibility. > Implement filter pushdown for Parquet > ------------------------------------- > > Key: DRILL-1950 > URL: https://issues.apache.org/jira/browse/DRILL-1950 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet > Reporter: Jason Altekruse > Assignee: Jacques Nadeau > Fix For: Future > > Attachments: DRILL-1950.1.patch.txt > > > The parquet reader currently supports project pushdown, for limiting the > number of columns read, however it does not use filter pushdown to read a > subset of the requested columns. This is particularly useful with parquet > files that contain statistics, most importantly min and max values on pages. > Evaluating predicates against these values could save some major reading and > decoding time. > The largest barrier to implementing this is the current design of the reader. > Firstly, we currently have two separate parquet readers, one for reading flat > files very quickly and another or reading complex data. There are > enhancements we can make the the flat reader, to make it support nested data > in a much more efficient manner. However the speed of the flat file reader > currently comes from being able to make vectorized copies out the the parquet > file. This design is somewhat at odds with filter pushdown, as we will only > can make useful vectorized copies if the filter matches a large run of values > within the file. This might not be too rare a case, assuming files are often > somewhat sorted on a primary field like date or a numeric key, and these are > often fields used to limit the query to a subset of the data. However for > cases where we are filter out a few records here and there, we should just > make individual copies. > We need to do more design work on the best way to balance performance with > these use cases in mind. -- This message was sent by Atlassian JIRA (v6.3.4#6332)