[GitHub] spark pull request: Store serialized blocks as multiple chunks in ...

JoshRosen Tue, 15 Mar 2016 17:52:51 -0700

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/11748


    Store serialized blocks as multiple chunks in MemoryStore

    This patch modifies the BlockManager, MemoryStore, and several other 
storage components so that cached, serialized blocks are stored as multiple 
small chunks rather than as a single contiguous ByteBuffer.
    
    This change will help to improve the efficiency of memory allocation and 
the accuracy of memory accounting when serializing blocks. Our current 
serialization code uses a ByteBufferOutputStream, which doubles and 
re-allocates its backing byte array; this increases the peak memory 
requirements during serialization (since we need to hold extra memory while 
expanding the array). In addition, we currently don't account for the extra 
wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte 
serialized block may actually consume 256 megabytes of memory. After switching 
to storing blocks in multiple chunks, we'll be able to efficiently trim the 
backing buffers so that no space is wasted.
    
    This change is also a prerequisite to being able to cache blocks which are 
larger than 2GB (although full support for that depends on several other 
changes which have not bee implemented yet).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark chunked-block-serialization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11748.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11748
    
----
commit 735eca68d8efcd150d47631644cf848b4d98603e
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-03-15T04:57:16Z

    Split MemoryEntry into two separate classes (serialized and deserialized)

commit 8f0828986b72ce722cfe0360ae863971547fc58b
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-03-15T18:53:54Z

    Add ChunkedByteBuffer and use it in storage layer.

commit 79b1a6a31236b81c444dda1e8ee1cfdf2f3c36ae
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-03-15T20:53:27Z

    Add test cases and fix bug in ChunkedByteBuffer.toInputStream()

commit 7dbcd5a9ef0c669f5db97990af944d8b63300e97
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-03-15T22:05:23Z

    WIP towards understanding destruction.

commit 3fbec212d9f714386121b4aed791d6c9fb1359a2
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-03-15T22:39:27Z

    Small fixes to dispose behavior.

commit e5e663f22094333dac6e184c78176ee658e3441e
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-03-15T22:49:24Z

    Modify BlockManager.dataSerialize to write ChunkedByteBuffers.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Store serialized blocks as multiple chunks in ...

Reply via email to