Re: Processing Large Images in Spark?

2015-04-07 Thread Steve Loughran

On 6 Apr 2015, at 23:05, Patrick Young 
patrick.mckendree.yo...@gmail.commailto:patrick.mckendree.yo...@gmail.com 
wrote:

does anyone have any thoughts on storing a really large raster in HDFS?  Seems 
like if I just dump the image into HDFS as it, it'll get stored in blocks all 
across the system and when I go to read it, there will be a ton of network 
traffic from all the blocks to the reading node!

It get splt into blocks scattered (at default 3x replication) to: 1 on current 
host, 2 elsewhere.

I'd recommend you look @ Russell Perry's @ HPLabs's 2009 paper, High Speed 
Raster Image Streaming For Digital Presses Using the Hadoop File System, which 
was about using HDFS/MapReduce to render images, rather than analyse them. 
Similar to things like tile generation of google/open-street/apple maps:

http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf

Russ modified the HDFS client so that rather than have it pick a block for the 
app to read from, the app got to make the decision itself. Then running code on 
the server hooked straight up to the line rate printing press, fetched data 
from different racks so as to get max bandwidth out of each host, and of each 
rack switch; 4 Gb/s overall. I don' think that patch was ever contributed back 
—or at least got in.



Re: Processing Large Images in Spark?

2015-04-07 Thread andy petrella
Heya,

You might be interesting at looking at GeoTrellis
They use RDDs of Tiles to process big images like Landsat ones can be
(specially 8).

However, I see you have only 1G per file, so I guess you only care of a
single band? Or is it a reboxed pic?

Note: I think the GeoTrellis image format is still single band, although
it's highly optimized for distributed geoprocessing

my2¢
andy


On Tue, Apr 7, 2015 at 12:06 AM Patrick Young 
patrick.mckendree.yo...@gmail.com wrote:

 Hi all,

 I'm new to Spark and wondering if it's appropriate to use for some image
 processing tasks on pretty sizable (~1 GB) images.

 Here is an example use case.  Amazon recently put the entire Landsat8
 archive in S3:

 http://aws.amazon.com/public-data-sets/landsat/

 I have a bunch of GDAL based (a C library for geospatial raster I/O)
 Python scripts that take a collection of Landsat images and mash them into
 a single mosaic.  This works great for little mosaics, but if I wanted to
 do the entire world, I need more horse power!  The scripts do the following:

1. Copy the selected rasters down from S3 to the local file system
2. Read each image into memory as numpy arrays (a big 3D array), do
some image processing using various Python libs, and write the result out
to the local file system
3. Blast all the processed imagery back to S3, and hooks up MapServer
for viewing

 Step 2 takes a long time; this is what I'd like to leverage Spark for.
 Each image, if you stack all the bands together, can be ~1 GB in size.

 So here are a couple of questions:


1. If I have a large image/array, what's a good way of getting it into
an RDD?  I've seen some stuff about folks tiling up imagery into little
chunks and storing it in HBase.  I imagine I would want an image chunk in
each partition of the RDD.  If I wanted to apply something like a gaussian
filter I'd need each chunk to to overlap a bit.
2. In a similar vain, does anyone have any thoughts on storing a
really large raster in HDFS?  Seems like if I just dump the image into HDFS
as it, it'll get stored in blocks all across the system and when I go to
read it, there will be a ton of network traffic from all the blocks to the
reading node!
3. How is the numpy's ndarray support in Spark?  For instance, if I do
a map on my theoretical chunked image RDD, can I easily realize the image
chunk as a numpy array inside the function?  Most of the Python algorithms
I use take in and return a numpy array.

 I saw some discussion in the past on image processing:

 These threads talk about processing lots of little images, but this isn't
 really my situation as I've got one very large image:


 http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html

 http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html

 Further, I'd like to have the imagery in HDFS rather than on the file
 system to avoid I/O bottlenecks if possible!

 Thanks for any ideas and advice!
 -Patrick





Processing Large Images in Spark?

2015-04-06 Thread Patrick Young
Hi all,

I'm new to Spark and wondering if it's appropriate to use for some image
processing tasks on pretty sizable (~1 GB) images.

Here is an example use case.  Amazon recently put the entire Landsat8
archive in S3:

http://aws.amazon.com/public-data-sets/landsat/

I have a bunch of GDAL based (a C library for geospatial raster I/O) Python
scripts that take a collection of Landsat images and mash them into a
single mosaic.  This works great for little mosaics, but if I wanted to do
the entire world, I need more horse power!  The scripts do the following:

   1. Copy the selected rasters down from S3 to the local file system
   2. Read each image into memory as numpy arrays (a big 3D array), do some
   image processing using various Python libs, and write the result out to the
   local file system
   3. Blast all the processed imagery back to S3, and hooks up MapServer
   for viewing

Step 2 takes a long time; this is what I'd like to leverage Spark for.
Each image, if you stack all the bands together, can be ~1 GB in size.

So here are a couple of questions:


   1. If I have a large image/array, what's a good way of getting it into
   an RDD?  I've seen some stuff about folks tiling up imagery into little
   chunks and storing it in HBase.  I imagine I would want an image chunk in
   each partition of the RDD.  If I wanted to apply something like a gaussian
   filter I'd need each chunk to to overlap a bit.
   2. In a similar vain, does anyone have any thoughts on storing a really
   large raster in HDFS?  Seems like if I just dump the image into HDFS as it,
   it'll get stored in blocks all across the system and when I go to read it,
   there will be a ton of network traffic from all the blocks to the reading
   node!
   3. How is the numpy's ndarray support in Spark?  For instance, if I do a
   map on my theoretical chunked image RDD, can I easily realize the image
   chunk as a numpy array inside the function?  Most of the Python algorithms
   I use take in and return a numpy array.

I saw some discussion in the past on image processing:

These threads talk about processing lots of little images, but this isn't
really my situation as I've got one very large image:

http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html

Further, I'd like to have the imagery in HDFS rather than on the file
system to avoid I/O bottlenecks if possible!

Thanks for any ideas and advice!
-Patrick


Processing Large Images in Spark?

2015-04-06 Thread patrick.mckendree.young
Hi all,

I'm new to Spark and wondering if it's appropriate to use for some image
processing tasks on pretty sizable (~1 GB) images.

Here is an example use case.  Amazon recently put the entire Landsat8
archive in S3:

http://aws.amazon.com/public-data-sets/landsat/

I have a bunch of GDAL based (a C library for geospatial raster I/O) Python
scripts that take a collection of Landsat images and mash them into a
single mosaic.  This works great for little mosaics, but if I wanted to do
the entire world, I need more horse power!  The scripts do the following:

   1. Copy the selected rasters down from S3 to the local file system
   2. Read each image into memory as numpy arrays (a big 3D array), do some
   image processing using various Python libs, and write the result out to the
   local file system
   3. Blast all the processed imagery back to S3, and hooks up MapServer
   for viewing

Step 2 takes a long time; this is what I'd like to leverage Spark for.
Each image, if you stack all the bands together, can be ~1 GB in size.

So here are a couple of questions:


   1. If I have a large image/array, what's a good way of getting it into
   an RDD?  I've seen some stuff about folks tiling up imagery into little
   chunks and storing it in HBase.  I imagine I would want an image chunk in
   each partition of the RDD.  If I wanted to apply something like a gaussian
   filter I'd need each chunk to to overlap a bit.
   2. In a similar vain, does anyone have any thoughts on storing a really
   large raster in HDFS?  Seems like if I just dump the image into HDFS as it,
   it'll get stored in blocks all across the system and when I go to read it,
   there will be a ton of network traffic from all the blocks to the reading
   node!
   3. How is the numpy's ndarray support in Spark?  For instance, if I do a
   map on my theoretical chunked image RDD, can I easily realize the image
   chunk as a numpy array inside the function?  Most of the Python algorithms
   I use take in and return a numpy array.

I saw some discussion in the past on image processing:

These threads talk about processing lots of little images, but this isn't
really my situation as I've got one very large image:

http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html

Further, I'd like to have the imagery in HDFS rather than on the file
system to avoid I/O bottlenecks if possible!

Thanks for any ideas and advice!
-Patrick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-Large-Images-in-Spark-tp22397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.