Processing Large Images in Spark?

Patrick Young Mon, 06 Apr 2015 15:07:20 -0700

Hi all,

I'm new to Spark and wondering if it's appropriate to use for some image
processing tasks on pretty sizable (~1 GB) images.

Here is an example use case. Amazon recently put the entire Landsat8
archive in S3:

http://aws.amazon.com/public-data-sets/landsat/

I have a bunch of GDAL based (a C library for geospatial raster I/O) Python
scripts that take a collection of Landsat images and mash them into a
single mosaic. This works great for little mosaics, but if I wanted to do
the entire world, I need more horse power! The scripts do the following:

1. Copy the selected rasters down from S3 to the local file system
2. Read each image into memory as numpy arrays (a big 3D array), do some
image processing using various Python libs, and write the result out to the
local file system
3. Blast all the processed imagery back to S3, and hooks up MapServer
for viewing

Step 2 takes a long time; this is what I'd like to leverage Spark for.
Each image, if you stack all the bands together, can be ~1 GB in size.

So here are a couple of questions:

1. If I have a large image/array, what's a good way of getting it into
an RDD? I've seen some stuff about folks tiling up imagery into little
chunks and storing it in HBase. I imagine I would want an image chunk in
each partition of the RDD. If I wanted to apply something like a gaussian
filter I'd need each chunk to to overlap a bit.
2. In a similar vain, does anyone have any thoughts on storing a really
large raster in HDFS? Seems like if I just dump the image into HDFS as it,
it'll get stored in blocks all across the system and when I go to read it,
there will be a ton of network traffic from all the blocks to the reading
node!
3. How is the numpy's ndarray support in Spark? For instance, if I do a
map on my theoretical chunked image RDD, can I easily realize the image
chunk as a numpy array inside the function? Most of the Python algorithms
I use take in and return a numpy array.

I saw some discussion in the past on image processing:

These threads talk about processing lots of little images, but this isn't
really my situation as I've got one very large image:

http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html

Further, I'd like to have the imagery in HDFS rather than on the file
system to avoid I/O bottlenecks if possible!

Thanks for any ideas and advice!
-Patrick

Processing Large Images in Spark?

Reply via email to