Hi I just want to confirm my understanding of the physical plan generated by Spark SQL while reading from a Parquet file.
When multiple predicates are pushed to the PrunedFilterScan, does Spark ensure that the Parquet file is not read multiple times while evaluating each predicate ? In general, is this optimization done for all columnar databases or file formats ? When I ran the following query in the spark-shell > val nameDF = sqlContext.sql("SELECT name FROM parquetFile WHERE age = 50 AND name = 'someone'") I saw that both the filters are pushed, but I can't seem to find where it applies them to the file data. > nameDF.explain() shows Project [name#112] +- Filter ((age#111L = 50) && (name#112 = someone)) +- Scan ParquetRelation[name#112,age#111L] InputPaths: file:/home/spark/spark-1.6.1/people.parquet, PushedFilters: [EqualTo(age,50), EqualTo(name,someone)]