[GitHub] spark pull request: [SPARK-8125] [SQL] Accelerates Parquet schema ...

liancheng Tue, 14 Jul 2015 08:12:16 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/7396


    [SPARK-8125] [SQL] Accelerates Parquet schema merging and partition 
discovery

    This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` 
partition discovery.  The acceleration is done by the following means:
    
    - Turning off schema merging by default
    
      Schema merging is not the most common case, but requires reading footers 
of all Parquet part-files and can be very slow.
    
    - Avoiding `FileSystem.globStatus()` call when possible
    
      `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and 
can be very slow (esp. on S3).  This PR adds 
`SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the 
path contain glob-pattern specific character(s) (`{}[]*?\`).
    
    - Listing leaf files in parallel when the number of input paths exceeds a 
threshold
    
      Listing leaf files is required by partition discovery.  Currently it is 
done on driver side, and can be slow when there are lots of (nested) 
directories, since each `FileSystem.listStatus()` call issues an RPC.  In this 
PR, we list leaf files in a BFS style, and resort to a Spark job once we found 
that the number of directories need to be listed exceed a threshold.
    
      The threshold is controlled by `SQLConf` option 
`spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32.
    
    - Discovering Parquet schema in parallel
    
      Currently, schema merging is also done on driver side, and needs to read 
footers of all part-files.  This PR uses a Spark job to do schema merging.  
Together with task side metadata reading in Parquet 1.7.0, we never read any 
footers on driver side now.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark accel-parquet

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7396.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7396
    
----
commit e2d07af20d1409eaa5d49235e3255e0e99f6502c
Author: Cheng Lian <l...@databricks.com>
Date:   2015-07-01T23:32:44Z

    Moves schema merging to executor side
    
    Removes some dead code
    
    Parallelizes input paths listing

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8125] [SQL] Accelerates Parquet schema ...

Reply via email to