Re: Processing Large Images in Spark?
On 6 Apr 2015, at 23:05, Patrick Young patrick.mckendree.yo...@gmail.commailto:patrick.mckendree.yo...@gmail.com wrote: does anyone have any thoughts on storing a really large raster in HDFS? Seems like if I just dump the image into HDFS as it, it'll get stored in blocks all across the system and when I go to read it, there will be a ton of network traffic from all the blocks to the reading node! It get splt into blocks scattered (at default 3x replication) to: 1 on current host, 2 elsewhere. I'd recommend you look @ Russell Perry's @ HPLabs's 2009 paper, High Speed Raster Image Streaming For Digital Presses Using the Hadoop File System, which was about using HDFS/MapReduce to render images, rather than analyse them. Similar to things like tile generation of google/open-street/apple maps: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf Russ modified the HDFS client so that rather than have it pick a block for the app to read from, the app got to make the decision itself. Then running code on the server hooked straight up to the line rate printing press, fetched data from different racks so as to get max bandwidth out of each host, and of each rack switch; 4 Gb/s overall. I don' think that patch was ever contributed back —or at least got in.
Re: Processing Large Images in Spark?
Heya, You might be interesting at looking at GeoTrellis They use RDDs of Tiles to process big images like Landsat ones can be (specially 8). However, I see you have only 1G per file, so I guess you only care of a single band? Or is it a reboxed pic? Note: I think the GeoTrellis image format is still single band, although it's highly optimized for distributed geoprocessing my2¢ andy On Tue, Apr 7, 2015 at 12:06 AM Patrick Young patrick.mckendree.yo...@gmail.com wrote: Hi all, I'm new to Spark and wondering if it's appropriate to use for some image processing tasks on pretty sizable (~1 GB) images. Here is an example use case. Amazon recently put the entire Landsat8 archive in S3: http://aws.amazon.com/public-data-sets/landsat/ I have a bunch of GDAL based (a C library for geospatial raster I/O) Python scripts that take a collection of Landsat images and mash them into a single mosaic. This works great for little mosaics, but if I wanted to do the entire world, I need more horse power! The scripts do the following: 1. Copy the selected rasters down from S3 to the local file system 2. Read each image into memory as numpy arrays (a big 3D array), do some image processing using various Python libs, and write the result out to the local file system 3. Blast all the processed imagery back to S3, and hooks up MapServer for viewing Step 2 takes a long time; this is what I'd like to leverage Spark for. Each image, if you stack all the bands together, can be ~1 GB in size. So here are a couple of questions: 1. If I have a large image/array, what's a good way of getting it into an RDD? I've seen some stuff about folks tiling up imagery into little chunks and storing it in HBase. I imagine I would want an image chunk in each partition of the RDD. If I wanted to apply something like a gaussian filter I'd need each chunk to to overlap a bit. 2. In a similar vain, does anyone have any thoughts on storing a really large raster in HDFS? Seems like if I just dump the image into HDFS as it, it'll get stored in blocks all across the system and when I go to read it, there will be a ton of network traffic from all the blocks to the reading node! 3. How is the numpy's ndarray support in Spark? For instance, if I do a map on my theoretical chunked image RDD, can I easily realize the image chunk as a numpy array inside the function? Most of the Python algorithms I use take in and return a numpy array. I saw some discussion in the past on image processing: These threads talk about processing lots of little images, but this isn't really my situation as I've got one very large image: http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html Further, I'd like to have the imagery in HDFS rather than on the file system to avoid I/O bottlenecks if possible! Thanks for any ideas and advice! -Patrick
Processing Large Images in Spark?
Hi all, I'm new to Spark and wondering if it's appropriate to use for some image processing tasks on pretty sizable (~1 GB) images. Here is an example use case. Amazon recently put the entire Landsat8 archive in S3: http://aws.amazon.com/public-data-sets/landsat/ I have a bunch of GDAL based (a C library for geospatial raster I/O) Python scripts that take a collection of Landsat images and mash them into a single mosaic. This works great for little mosaics, but if I wanted to do the entire world, I need more horse power! The scripts do the following: 1. Copy the selected rasters down from S3 to the local file system 2. Read each image into memory as numpy arrays (a big 3D array), do some image processing using various Python libs, and write the result out to the local file system 3. Blast all the processed imagery back to S3, and hooks up MapServer for viewing Step 2 takes a long time; this is what I'd like to leverage Spark for. Each image, if you stack all the bands together, can be ~1 GB in size. So here are a couple of questions: 1. If I have a large image/array, what's a good way of getting it into an RDD? I've seen some stuff about folks tiling up imagery into little chunks and storing it in HBase. I imagine I would want an image chunk in each partition of the RDD. If I wanted to apply something like a gaussian filter I'd need each chunk to to overlap a bit. 2. In a similar vain, does anyone have any thoughts on storing a really large raster in HDFS? Seems like if I just dump the image into HDFS as it, it'll get stored in blocks all across the system and when I go to read it, there will be a ton of network traffic from all the blocks to the reading node! 3. How is the numpy's ndarray support in Spark? For instance, if I do a map on my theoretical chunked image RDD, can I easily realize the image chunk as a numpy array inside the function? Most of the Python algorithms I use take in and return a numpy array. I saw some discussion in the past on image processing: These threads talk about processing lots of little images, but this isn't really my situation as I've got one very large image: http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html Further, I'd like to have the imagery in HDFS rather than on the file system to avoid I/O bottlenecks if possible! Thanks for any ideas and advice! -Patrick
Processing Large Images in Spark?
Hi all, I'm new to Spark and wondering if it's appropriate to use for some image processing tasks on pretty sizable (~1 GB) images. Here is an example use case. Amazon recently put the entire Landsat8 archive in S3: http://aws.amazon.com/public-data-sets/landsat/ I have a bunch of GDAL based (a C library for geospatial raster I/O) Python scripts that take a collection of Landsat images and mash them into a single mosaic. This works great for little mosaics, but if I wanted to do the entire world, I need more horse power! The scripts do the following: 1. Copy the selected rasters down from S3 to the local file system 2. Read each image into memory as numpy arrays (a big 3D array), do some image processing using various Python libs, and write the result out to the local file system 3. Blast all the processed imagery back to S3, and hooks up MapServer for viewing Step 2 takes a long time; this is what I'd like to leverage Spark for. Each image, if you stack all the bands together, can be ~1 GB in size. So here are a couple of questions: 1. If I have a large image/array, what's a good way of getting it into an RDD? I've seen some stuff about folks tiling up imagery into little chunks and storing it in HBase. I imagine I would want an image chunk in each partition of the RDD. If I wanted to apply something like a gaussian filter I'd need each chunk to to overlap a bit. 2. In a similar vain, does anyone have any thoughts on storing a really large raster in HDFS? Seems like if I just dump the image into HDFS as it, it'll get stored in blocks all across the system and when I go to read it, there will be a ton of network traffic from all the blocks to the reading node! 3. How is the numpy's ndarray support in Spark? For instance, if I do a map on my theoretical chunked image RDD, can I easily realize the image chunk as a numpy array inside the function? Most of the Python algorithms I use take in and return a numpy array. I saw some discussion in the past on image processing: These threads talk about processing lots of little images, but this isn't really my situation as I've got one very large image: http://apache-spark-user-list.1001560.n3.nabble.com/Better-way-to-process-large-image-data-set-td14533.html http://apache-spark-user-list.1001560.n3.nabble.com/Processing-audio-video-images-td6752.html Further, I'd like to have the imagery in HDFS rather than on the file system to avoid I/O bottlenecks if possible! Thanks for any ideas and advice! -Patrick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Processing-Large-Images-in-Spark-tp22397.html Sent from the Apache Spark User List mailing list archive at Nabble.com.