Github user andreweduffy commented on the issue:

    https://github.com/apache/spark/pull/14649
  
    Hyun mostly sums it up. This uses the summary metadata for Parquet when 
available. Rather than performing row group level filtering, it actually 
filters out entire files when summary metadata is available. It does this when 
it's constructing the FileScanRDD, which means it actually only spawns tasks 
for files that match the predicate. At work we were running into issues with S3 
deployments where very large S3 datasets would take exceedingly long to load in 
Spark. Empirically, we're running this exact patch in production and for many 
types of queries, we see a very large decrease in tasks created and time spent 
fetching from S3. So this is mainly for the use case of short-lived RDDs (so 
doing .persist doesn't help you) that are backed by data in S3 (so eliminating 
read time is actually a significant speed up)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to