Hi all,

I'm new to Spark and wondering if it's appropriate to use for some image
processing tasks on pretty sizable (~1 GB) images.

Here is an example use case.  Amazon recently put the entire Landsat8
archive in S3:

http://aws.amazon.com/public-data-sets/landsat/

I have a bunch of GDAL based (a C library for geospatial raster I/O) Python
scripts that take a collection of Landsat images and mash them into a
single mosaic.  This works great for little mosaics, but if I wanted to do
the entire world, I need more horse power!  The scripts do the following:

   1. Copy the selected rasters down from S3 to the local file system
   2. Read each image into memory as numpy arrays (a big 3D array), do some
   image processing using various Python libs, and write the result out to the
   local file system
   3. Blast all the processed imagery back to S3, and hooks up MapServer
   for viewing

Step 2 takes a long time; this is what I'd like to leverage Spark for.
Each image, if you stack all the bands together, can be ~1 GB in size.

So here are a couple of questions:


   1. If I have a large image/array, what's a good way of getting it into
   an RDD?  I've seen some stuff about folks tiling up imagery into little
   chunks and storing it in HBase.  I imagine I would want an image chunk in
   each partition of the RDD.  If I wanted to apply something like a gaussian
   filter I'd need each chunk to to overlap a bit.
   2. In a similar vain, does anyone have any thoughts on storing a really
   large raster in HDFS?  Seems like if I just dump the image into HDFS as it,
   it'll get stored in blocks all across the system and when I go to read it,
   there will be a ton of network traffic from all the blocks to the reading
   node!
   3. How is the numpy's ndarray support in Spark?  For instance, if I do a
   map on my theoretical chunked image RDD, can I easily realize the image
   chunk as a numpy array inside the function?  Most of the Python algorithms
   I use take in and return a numpy array.

I saw some discussion in the past on image processing:

These threads talk about processing lots of little images, but this isn't
really my situation as I've got one very large image:

http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html

Further, I'd like to have the imagery in HDFS rather than on the file
system to avoid I/O bottlenecks if possible!

Thanks for any ideas and advice!
-Patrick

Reply via email to