This is one problem I'd like to address soon - providing a binary block
management interface for shuffle (and maybe other things) that avoids
serialization/copying.


On Fri, Feb 27, 2015 at 3:39 PM, Paul Wais <paulw...@gmail.com> wrote:

> Dear List,
>
> I'm investigating some problems related to native code integration
> with Spark, and while picking through BlockManager I noticed that data
> (de)serialization currently issues lots of array copies.
> Specifically:
>
> - Deserialization: BlockManager marshals all deserialized bytes
> through a spark.util. ByteBufferInputStream, which necessitates
> copying data into an intermediate temporary byte[] .  The temporary
> byte[] might be reused between deserialization of T instances, but
> nevertheless the bytes must be copied (and likely in a Java loop).
>
> - Serialization: BlockManager buffers all serialized bytes into a
> java.io.ByteArrayOutputStream, which maintains an internal byte[]
> buffer and grows/re-copies the buffer like a vector as the buffer
> fills.  BlockManager then retrieves the internal byte[] buffer, wraps
> it in a ByteBuffer, and sends it off to be stored (e.g. in
> MemoryStore, DiskStore, Tachyon, etc).
>
> When an individual T is somewhat large (e.g. a feature vector, an
> image, etc), or blocks are megabytes in size, these copies become
> expensive (especially for large written blocks), right?  Does anybody
> have any measurements of /how/ expensive they are?  If not, is there
> serialization benchmark code (e.g. for KryoSerializer ) that might be
> helpful here?
>
>
> As part of my investigation, I've found that one might be able to
> sidestep these issues by extending Spark's SerializerInstance API to
> offer I/O on ByteBuffers (in addition to {Input,Output}Streams).  An
> extension including a ByteBuffer API would furthermore have many
> benefits for native code.  A major downside of this API addition is
> that it wouldn't interoperate (nontrivially) with compression, so
> shuffles wouldn't benefit.  Nevertheless, BlockManager could probably
> deduce when use of this ByteBuffer API is possible and leverage it.
>
> Cheers,
> -Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to