Hi all,

I'm trying to process a large image data set and need some way to optimize
my implementation since it's very slow from now. In my current
implementation I store my images in an object file with the following fields

case class Image(groupId: String, imageId: String, buffer: String)

Images belong to groups and have an id, the buffer is the image file (jpg,
png) encode in base 64 string.

Before running an image processing algorithm on the image buffer, I have a
lot of jobs that filter, group, join images in my data set based on groupId
or imageId and theses steps are relatively slow. I suspect that spark moves
around my image buffer even if it's not necessary for these specific jobs
and then there's a lot of communication times waste.

Is there a better way to optimize my implementation ?

Regards,

Jaonary

Reply via email to