Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-89631527 Ah, I'm also considering similar optimizations for Spark 1.4 :) The tricky part here is that, when scanning the Parquet table, Spark needs to call `ParquetInputFormat.getSplits` to compute (Spark) partition information. This `getSplits` call can be super expensive as it needs to read footers of all Parquet part-files to compute the Parquet splits. And that's why `ParquetRelation2` caches those footers at the very beginning and inject them into an extended Parquet input format. With all these footers cached, `ParquetRelation2.readSchma()` is actually quite lightweight. So the real bottleneck is reading all those footers. Fortunately, Parquet is also trying to avoid reading footers entirely at the driver side (see https://github.com/apache/incubator-parquet-mr/pull/91 and https://github.com/apache/incubator-parquet-mr/pull/45). After upgrading to Parquet 1.6, which is expected to be released next week, we can do this properly for better performance. So ideally, we don't read footers on driver side, and when we have a central arbitrative schema at hand, either from metastore or data source DDL, we don't do schema merging at driver side either. I haven't got time to walk through all related Parquet code path and PRs yet, so the above statements may be inaccurate. Please correct me if you find any mistakes.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org