GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/11748
Store serialized blocks as multiple chunks in MemoryStore This patch modifies the BlockManager, MemoryStore, and several other storage components so that cached, serialized blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer. This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted. This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet). You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark chunked-block-serialization Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11748.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11748 ---- commit 735eca68d8efcd150d47631644cf848b4d98603e Author: Josh Rosen <joshro...@databricks.com> Date: 2016-03-15T04:57:16Z Split MemoryEntry into two separate classes (serialized and deserialized) commit 8f0828986b72ce722cfe0360ae863971547fc58b Author: Josh Rosen <joshro...@databricks.com> Date: 2016-03-15T18:53:54Z Add ChunkedByteBuffer and use it in storage layer. commit 79b1a6a31236b81c444dda1e8ee1cfdf2f3c36ae Author: Josh Rosen <joshro...@databricks.com> Date: 2016-03-15T20:53:27Z Add test cases and fix bug in ChunkedByteBuffer.toInputStream() commit 7dbcd5a9ef0c669f5db97990af944d8b63300e97 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-03-15T22:05:23Z WIP towards understanding destruction. commit 3fbec212d9f714386121b4aed791d6c9fb1359a2 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-03-15T22:39:27Z Small fixes to dispose behavior. commit e5e663f22094333dac6e184c78176ee658e3441e Author: Josh Rosen <joshro...@databricks.com> Date: 2016-03-15T22:49:24Z Modify BlockManager.dataSerialize to write ChunkedByteBuffers. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org