You may have a look at APIs in the |parquet.filter2| package. We use this in Spark SQL to enable Parquet filter push-down. Basically you need to convert your query predicate into Parquet |FilterPredicate|, then set it into the Hadoop configuration object via |ParquetInputFormat.setFilterPredicate|. Here is how we do the predicate conversion in Spark SQL: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala

Cheng

On 2/5/15 10:22 PM, Mohit Jaggi wrote:

Hi Parquet Developers,
I have a use case where I may repeatedly (but from different processes) “query” a 
large parquet file for specific rows. The query is a filter on one of the columns 
and that column is just an increasing integer(e.g. 1, 2, 3, 4…). If I naively use 
predicate pushdown, the whole file will be scanned for every query, right? But there 
is enough metadata to allow me to skip “pages” and “row groups" that don’t have 
a match. Is there an API that I can use to skip over “row groups” and “pages” and 
scan only the pages that have the row I am looking for? I saw references to 
“metadata based predicate pushdown” and “indexes in parquet 2.0”, so I guess such 
APIs do exist.

Thanks for your help,
Mohit.

Reply via email to