GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/1370
[SPARK-2119][SQL] Improved Parquet performance when reading off S3 JIRA issue: [SPARK-2119](https://issues.apache.org/jira/browse/SPARK-2119) Essentially this PR fixed three issues to gain much better performance when reading large Parquet file off S3. 1. When reading the schema, fetching Parquet metadata from a part-file rather than the `_metadata` file The `_metadata` file contains metadata of all row groups, and can be very large if there are many row groups. Since schema information and row group metadata are coupled within a single Thrift object, we have to read the whole `_metadata` to fetch the schema. On the other hand, schema is replicated among footers of all part-files, which are fairly small. 1. Only add the root directory of the Parquet file rather than all the part-files to input paths HDFS API can automatically filter out all hidden files and underscore files (`_SUCCESS` & `_metadata`), there's no need to filter out all part-files and add them individually to input paths. What make it much worse is that, `FileInputFormat.listStatus()` calls `FileSystem.globStatus()` on each individual input path sequentially, each results a blocking remote S3 HTTP request. 1. Worked around [PARQUET-16](https://issues.apache.org/jira/browse/PARQUET-16) Essentially PARQUET-16 is similar to the above issue, and results lots of sequential `FileSystem.getFileStatus()` calls, which are further translated into a bunch of remote S3 HTTP requests. `FilteringParquetRowInputFormat` should be cleaned up once PARQUET-16 is fixed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark faster-parquet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1370.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1370 ---- commit 5bd3d29f9fa118719c94d1f5acffa24d6f1a755d Author: Cheng Lian <lian.cs....@gmail.com> Date: 2014-07-06T04:59:01Z Fixed Parquet log level commit 1c0d1b923a57fddd1fe67270c71e28ac0324de04 Author: Cheng Lian <lian.cs....@gmail.com> Date: 2014-07-09T01:53:38Z Accelerated Parquet schema retrieving commit d2c4417a45dff48ad52a830695f9d68f9ed8531f Author: Cheng Lian <lian.cs....@gmail.com> Date: 2014-07-10T20:17:57Z Worked around PARQUET-16 to improve Parquet performance ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---