Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14649 Sorry for the late reply. Firstly, Spark SQL only reads footers of all Parquet files in case of schema merging, which can be controlled by SQL option `spark.sql.parquet.mergeSchema`. Because you have to figure out schemas of every individual physical Parquet files to determine the global schema. When schema merging is disabled, which is the default case, summary files (`_metadata` and/or `_common_metadata`) are still used if there're any. If no summary files are available, Spark SQL just reads the footer of a random Parquet file and gets the schema. So it seems that the first point mentioned in you PR description is not really a problem? Secondly, although you mentioned "partition pruning", but what the code change in this PR performs is actually Parquet row group filtering, which is already a feature of Spark SQL. Thirdly, partition pruning is already implemented in Spark SQL. Furthermore, since partition pruning is handled inside the framework of Spark SQL, not only data source filters, but also arbitrary Catalyst expressions can be used to prune partitions. That said, I don't see benefits from this PR. Did I miss something here?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org