What Sean said.
You should also definitely turn on Kryo serialization. The default
Java serialization is really really slow if you're gonna move around
lots of data.Also make sure you use a cluster with high network
bandwidth on.
On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen so...@cloudera.com
Hi all,
I'm trying to process a large image data set and need some way to optimize
my implementation since it's very slow from now. In my current
implementation I store my images in an object file with the following fields
case class Image(groupId: String, imageId: String, buffer: String)
Base 64 is an inefficient encoding for binary data by about 2.6x. You could
use byte[] directly.
But you would still be storing and potentially shuffling lots of data in
your RDDs.
If the files exist separately on HDFS perhaps you can just send around the
file location and load it directly using