Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/5298#issuecomment-89631527
  
    Ah, I'm also considering similar optimizations for Spark 1.4 :)
    
    The tricky part here is that, when scanning the Parquet table, Spark needs 
to call `ParquetInputFormat.getSplits` to compute (Spark) partition 
information. This `getSplits` call can be super expensive as it needs to read 
footers of all Parquet part-files to compute the Parquet splits. And that's why 
`ParquetRelation2` caches those footers at the very beginning and inject them 
into an extended Parquet input format. With all these footers cached, 
`ParquetRelation2.readSchma()` is actually quite lightweight. So the real 
bottleneck is reading all those footers.
    
    Fortunately, Parquet is also trying to avoid reading footers entirely at 
the driver side (see https://github.com/apache/incubator-parquet-mr/pull/91 and 
https://github.com/apache/incubator-parquet-mr/pull/45). After upgrading to 
Parquet 1.6, which is expected to be released next week, we can do this 
properly for better performance.
    
    So ideally, we don't read footers on driver side, and when we have a 
central arbitrative schema at hand, either from metastore or data source DDL, 
we don't do schema merging at driver side either. I haven't got time to walk 
through all related Parquet code path and PRs yet, so the above statements may 
be inaccurate. Please correct me if you find any mistakes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to