You may have a look at APIs in the |parquet.filter2| package. We use
this in Spark SQL to enable Parquet filter push-down. Basically you need
to convert your query predicate into Parquet |FilterPredicate|, then set
it into the Hadoop configuration object via
|ParquetInputFormat.setFilterPredicate|. Here is how we do the predicate
conversion in Spark SQL:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala
Cheng
On 2/5/15 10:22 PM, Mohit Jaggi wrote:
Hi Parquet Developers,
I have a use case where I may repeatedly (but from different processes) “query” a
large parquet file for specific rows. The query is a filter on one of the columns
and that column is just an increasing integer(e.g. 1, 2, 3, 4…). If I naively use
predicate pushdown, the whole file will be scanned for every query, right? But there
is enough metadata to allow me to skip “pages” and “row groups" that don’t have
a match. Is there an API that I can use to skip over “row groups” and “pages” and
scan only the pages that have the row I am looking for? I saw references to
“metadata based predicate pushdown” and “indexes in parquet 2.0”, so I guess such
APIs do exist.
Thanks for your help,
Mohit.