SQL predicate pushdown on parquet or other columnar formats

Sandeep Joshi Mon, 01 Aug 2016 11:18:18 -0700

Hi

I just want to confirm my understanding of the physical plan generated by
Spark SQL while reading from a Parquet file.


When multiple predicates are pushed to the PrunedFilterScan, does Spark
ensure that the Parquet file is not read multiple times while evaluating
each predicate ?

In general, is this optimization done for all columnar databases or file
formats ?

When I ran the following query in the spark-shell

> val nameDF = sqlContext.sql("SELECT name FROM parquetFile WHERE age = 50
AND name = 'someone'")

I saw that both the filters are pushed, but I can't seem to find where it
applies them to the file data.

> nameDF.explain()

shows

Project [name#112]
+- Filter ((age#111L = 50) && (name#112 = someone))
   +- Scan ParquetRelation[name#112,age#111L] InputPaths:
file:/home/spark/spark-1.6.1/people.parquet,
      PushedFilters: [EqualTo(age,50), EqualTo(name,someone)]

SQL predicate pushdown on parquet or other columnar formats

Reply via email to