This is one problem I'd like to address soon - providing a binary block management interface for shuffle (and maybe other things) that avoids serialization/copying.
On Fri, Feb 27, 2015 at 3:39 PM, Paul Wais <paulw...@gmail.com> wrote: > Dear List, > > I'm investigating some problems related to native code integration > with Spark, and while picking through BlockManager I noticed that data > (de)serialization currently issues lots of array copies. > Specifically: > > - Deserialization: BlockManager marshals all deserialized bytes > through a spark.util. ByteBufferInputStream, which necessitates > copying data into an intermediate temporary byte[] . The temporary > byte[] might be reused between deserialization of T instances, but > nevertheless the bytes must be copied (and likely in a Java loop). > > - Serialization: BlockManager buffers all serialized bytes into a > java.io.ByteArrayOutputStream, which maintains an internal byte[] > buffer and grows/re-copies the buffer like a vector as the buffer > fills. BlockManager then retrieves the internal byte[] buffer, wraps > it in a ByteBuffer, and sends it off to be stored (e.g. in > MemoryStore, DiskStore, Tachyon, etc). > > When an individual T is somewhat large (e.g. a feature vector, an > image, etc), or blocks are megabytes in size, these copies become > expensive (especially for large written blocks), right? Does anybody > have any measurements of /how/ expensive they are? If not, is there > serialization benchmark code (e.g. for KryoSerializer ) that might be > helpful here? > > > As part of my investigation, I've found that one might be able to > sidestep these issues by extending Spark's SerializerInstance API to > offer I/O on ByteBuffers (in addition to {Input,Output}Streams). An > extension including a ByteBuffer API would furthermore have many > benefits for native code. A major downside of this API addition is > that it wouldn't interoperate (nontrivially) with compression, so > shuffles wouldn't benefit. Nevertheless, BlockManager could probably > deduce when use of this ByteBuffer API is possible and leverage it. > > Cheers, > -Paul > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >