[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

squito Sat, 26 May 2018 19:02:22 -0700

GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/21440


    [SPARK-24307][CORE] Support reading remote cached partitions > 2gb

    (1) Netty's ByteBuf cannot support data > 2gb.  So to transfer data from a
    ChunkedByteBuffer over the network, we use a custom version of
    FileRegion which is backed by the ChunkedByteBuffer.
    
    (2) On the receiving end, we need to expose all the data in a
    FileSegmentManagedBuffer as a ChunkedByteBuffer.  We do that by memory
    mapping the entire file in chunks.
    
    Added unit tests.  Ran the randomized test a couple of hundred times on my 
laptop.  Tests cover the equivalent of SPARK-24107 for the 
ChunkedByteBufferFileRegion.  Also tested on a cluster with remote cache reads 
>2gb (in memory and on disk).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark chunked_bb_file_region

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21440.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21440
    
----
commit 4373e27c2ec96b77a2311f5c5997ae5ca84bf6c5
Author: Imran Rashid <irashid@...>
Date:   2018-05-23T03:59:40Z

    [SPARK-24307][CORE] Support reading remote cached partitions > 2gb
    
    (1) Netty's ByteBuf cannot support data > 2gb.  So to transfer data from a
    ChunkedByteBuffer over the network, we use a custom version of
    FileRegion which is backed by the ChunkedByteBuffer.
    
    (2) On the receiving end, we need to expose all the data in a
    FileSegmentManagedBuffer as a ChunkedByteBuffer.  We do that by memory
    mapping the entire file in chunks.
    
    Added unit tests.  Also tested on a cluster with remote cache reads >
    2gb (in memory and on disk).

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21440: [SPARK-24307][CORE] Support reading remote cached...

Reply via email to