GitHub user raofu opened a pull request:

    https://github.com/apache/spark/pull/22157

    [SPARK-25126] Avoid creating Reader for all orc files

    In OrFileOperator.ReadSchema, a Reader is created for every file
    although only the first valid one is used. This uses significant
    amount of memory when there `paths` have a lot of files. In 2.3
    a different code path OrcUtils.readSchema is used for inferring
    schema for orc files. This commit change both function to creat
    Reader lazily.
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/raofu/spark SPARK-25126

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22157
    
----
commit 5a86b3618da695431c01ddbe4bb102a45f93b3b1
Author: Rao Fu <rao@...>
Date:   2018-08-17T23:40:05Z

    [SPARK-25126] Avoid creating Reader for all orc files
    
    In OrFileOperator.ReadSchema, a Reader is created for every file
    although only the first valid one is used. This uses significant
    amount of memory when there `paths` have a lot of files. In 2.3
    a different code path OrcUtils.readSchema is used for inferring
    schema for orc files. This commit change both function to creat
    Reader lazily.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to