Dear List,

I'm investigating some problems related to native code integration
with Spark, and while picking through BlockManager I noticed that data
(de)serialization currently issues lots of array copies.
Specifically:

- Deserialization: BlockManager marshals all deserialized bytes
through a spark.util. ByteBufferInputStream, which necessitates
copying data into an intermediate temporary byte[] .  The temporary
byte[] might be reused between deserialization of T instances, but
nevertheless the bytes must be copied (and likely in a Java loop).

- Serialization: BlockManager buffers all serialized bytes into a
java.io.ByteArrayOutputStream, which maintains an internal byte[]
buffer and grows/re-copies the buffer like a vector as the buffer
fills.  BlockManager then retrieves the internal byte[] buffer, wraps
it in a ByteBuffer, and sends it off to be stored (e.g. in
MemoryStore, DiskStore, Tachyon, etc).

When an individual T is somewhat large (e.g. a feature vector, an
image, etc), or blocks are megabytes in size, these copies become
expensive (especially for large written blocks), right?  Does anybody
have any measurements of /how/ expensive they are?  If not, is there
serialization benchmark code (e.g. for KryoSerializer ) that might be
helpful here?


As part of my investigation, I've found that one might be able to
sidestep these issues by extending Spark's SerializerInstance API to
offer I/O on ByteBuffers (in addition to {Input,Output}Streams).  An
extension including a ByteBuffer API would furthermore have many
benefits for native code.  A major downside of this API addition is
that it wouldn't interoperate (nontrivially) with compression, so
shuffles wouldn't benefit.  Nevertheless, BlockManager could probably
deduce when use of this ByteBuffer API is possible and leverage it.

Cheers,
-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to