Dear List, I'm investigating some problems related to native code integration with Spark, and while picking through BlockManager I noticed that data (de)serialization currently issues lots of array copies. Specifically:
- Deserialization: BlockManager marshals all deserialized bytes through a spark.util. ByteBufferInputStream, which necessitates copying data into an intermediate temporary byte[] . The temporary byte[] might be reused between deserialization of T instances, but nevertheless the bytes must be copied (and likely in a Java loop). - Serialization: BlockManager buffers all serialized bytes into a java.io.ByteArrayOutputStream, which maintains an internal byte[] buffer and grows/re-copies the buffer like a vector as the buffer fills. BlockManager then retrieves the internal byte[] buffer, wraps it in a ByteBuffer, and sends it off to be stored (e.g. in MemoryStore, DiskStore, Tachyon, etc). When an individual T is somewhat large (e.g. a feature vector, an image, etc), or blocks are megabytes in size, these copies become expensive (especially for large written blocks), right? Does anybody have any measurements of /how/ expensive they are? If not, is there serialization benchmark code (e.g. for KryoSerializer ) that might be helpful here? As part of my investigation, I've found that one might be able to sidestep these issues by extending Spark's SerializerInstance API to offer I/O on ByteBuffers (in addition to {Input,Output}Streams). An extension including a ByteBuffer API would furthermore have many benefits for native code. A major downside of this API addition is that it wouldn't interoperate (nontrivially) with compression, so shuffles wouldn't benefit. Nevertheless, BlockManager could probably deduce when use of this ByteBuffer API is possible and leverage it. Cheers, -Paul --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org