Re: metadata based predicate pushdown

Cheng Lian Fri, 06 Feb 2015 11:57:44 -0800

You may have a look at APIs in the |parquet.filter2| package. We usethis in Spark SQL to enable Parquet filter push-down. Basically you needto convert your query predicate into Parquet |FilterPredicate|, then setit into the Hadoop configuration object via|ParquetInputFormat.setFilterPredicate|. Here is how we do the predicateconversion in Spark SQL:https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala


Cheng


On 2/5/15 10:22 PM, Mohit Jaggi wrote:

Hi Parquet Developers,
I have a use case where I may repeatedly (but from different processes) “query” a 
large parquet file for specific rows. The query is a filter on one of the columns 
and that column is just an increasing integer(e.g. 1, 2, 3, 4…). If I naively use 
predicate pushdown, the whole file will be scanned for every query, right? But there 
is enough metadata to allow me to skip “pages” and “row groups" that don’t have 
a match. Is there an API that I can use to skip over “row groups” and “pages” and 
scan only the pages that have the row I am looking for? I saw references to 
“metadata based predicate pushdown” and “indexes in parquet 2.0”, so I guess such 
APIs do exist.

Thanks for your help,
Mohit.

Re: metadata based predicate pushdown

Reply via email to