[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

liancheng Thu, 10 Jul 2014 22:29:29 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/1370


    [SPARK-2119][SQL] Improved Parquet performance when reading off S3

    JIRA issue: [SPARK-2119](https://issues.apache.org/jira/browse/SPARK-2119)
    
    Essentially this PR fixed three issues to gain much better performance when 
reading large Parquet file off S3.
    
    1. When reading the schema, fetching Parquet metadata from a part-file 
rather than the `_metadata` file
    
       The `_metadata` file contains metadata of all row groups, and can be 
very large if there are many row groups. Since schema information and row group 
metadata are coupled within a single Thrift object, we have to read the whole 
`_metadata` to fetch the schema. On the other hand, schema is replicated among 
footers of all part-files, which are fairly small.
    
    1. Only add the root directory of the Parquet file rather than all the 
part-files to input paths
    
       HDFS API can automatically filter out all hidden files and underscore 
files (`_SUCCESS` & `_metadata`), there's no need to filter out all part-files 
and add them individually to input paths. What make it much worse is that, 
`FileInputFormat.listStatus()` calls `FileSystem.globStatus()` on each 
individual input path sequentially, each results a blocking remote S3 HTTP 
request.
    
    1. Worked around 
[PARQUET-16](https://issues.apache.org/jira/browse/PARQUET-16)
    
       Essentially PARQUET-16 is similar to the above issue, and results lots 
of sequential `FileSystem.getFileStatus()` calls, which are further translated 
into a bunch of remote S3 HTTP requests.
    
       `FilteringParquetRowInputFormat` should be cleaned up once PARQUET-16 is 
fixed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark faster-parquet

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1370.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1370
    
----
commit 5bd3d29f9fa118719c94d1f5acffa24d6f1a755d
Author: Cheng Lian <lian.cs....@gmail.com>
Date:   2014-07-06T04:59:01Z

    Fixed Parquet log level

commit 1c0d1b923a57fddd1fe67270c71e28ac0324de04
Author: Cheng Lian <lian.cs....@gmail.com>
Date:   2014-07-09T01:53:38Z

    Accelerated Parquet schema retrieving

commit d2c4417a45dff48ad52a830695f9d68f9ed8531f
Author: Cheng Lian <lian.cs....@gmail.com>
Date:   2014-07-10T20:17:57Z

    Worked around PARQUET-16 to improve Parquet performance

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

Reply via email to