When enabling mergedSchema and predicate filter, this fails since Parquet filters are pushed down regardless of each schema of the splits (or rather files).
Dominic Ricard reported this issue ( https://issues.apache.org/jira/browse/SPARK-11103) Even though this would work okay by setting spark.sql.parquet.filterPushdown to false, the default value of this is true. So this looks an issue. My questions are, is this clearly an issue? and if so, which way would this be handled? I thought this is an issue and I made three rough patches for this and tested them and this looks fine though. The first approach looks simpler and appropriate as I presume from the previous approaches such as https://issues.apache.org/jira/browse/SPARK-11153 However, in terms of safety and performances, I also want to ensure which one would be a proper approach before trying to open a PR. 1. Simply set false to spark.sql.parquet.filterPushdown when using mergeSchema 2. If spark.sql.parquet.filterPushdown is true, retrieve all the schema of every part-files (and also merged one) and check if each can accept the given schema and then, apply the filter only when they all can accept, which I think it's a bit over-implemented. 3. If spark.sql.parquet.filterPushdown is true, retrieve all the schema of every part-files (and also merged one) and apply the filter to each split (rather file) that can accept the filter which (I think it's hacky) ends up different configurations for each task in a job.