GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/19476

    [SPARK-22062][CORE] Spill large block to disk in BlockManager's remote 
fetch to avoid OOM

    ## What changes were proposed in this pull request?
    
    In the current BlockManager's `getRemoteBytes`, it will call 
`BlockTransferService#fetchBlockSync` to get remote block. In the 
`fetchBlockSync`, Spark will allocate a temporary `ByteBuffer` to store the 
whole fetched block. This will potentially lead to OOM if block size is too big 
or several blocks are fetched simultaneously in this executor.
    
    So here leveraging the idea of shuffle fetch, to spill the large block to 
local disk before consumed by upstream code. The behavior is controlled by 
newly added configuration, if block size is smaller than the threshold, then 
this block will be persisted in memory; otherwise it will first spill to disk, 
and then read from disk file.
    
    To achieve this feature, what I did is:
    
    1. Rename `TempShuffleFileManager` to `TempFileManager`, since now it is 
not only used by shuffle.
    2. Add a new `TempFileManager` to manage the files of fetched remote 
blocks, the files are tracked by weak reference, will be deleted when no use at 
all.
    
    ## How was this patch tested?
    
    This was tested by adding UT, also manual verification in local test to 
perform GC to clean the files.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-22062

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19476.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19476
    
----
commit f50a7b75c303bd2cf261dfb1b4fe74fa5498ca4b
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-10-12T01:47:35Z

    Spill large blocks to disk during remote fetches in BlockManager

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to