Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14649
  
    Sorry for the late reply.
    
    Firstly, Spark SQL only reads footers of all Parquet files in case of 
schema merging, which can be controlled by SQL option 
`spark.sql.parquet.mergeSchema`. Because you have to figure out schemas of 
every individual physical Parquet files to determine the global schema. When 
schema merging is disabled, which is the default case, summary files 
(`_metadata` and/or `_common_metadata`) are still used if there're any. If no 
summary files are available, Spark SQL just reads the footer of a random 
Parquet file and gets the schema. So it seems that the first point mentioned in 
you PR description is not really a problem?
    
    Secondly, although you mentioned "partition pruning", but what the code 
change in this PR performs is actually Parquet row group filtering, which is 
already a feature of Spark SQL.
    
    Thirdly, partition pruning is already implemented in Spark SQL. 
Furthermore, since partition pruning is handled inside the framework of Spark 
SQL, not only data source filters, but also arbitrary Catalyst expressions can 
be used to prune partitions.
    
    That said, I don't see benefits from this PR. Did I miss something here?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to