Hi Andy, There would be a large architectural design effort if we decided to support Spark, or replace our current internal actor system with Spark. My thoughts are that the Spark DAG would be fully utilized in tracking lineage and scheduling tasks for the Spark backend, while our current Actor system would route operations using it's own mechanisms. There will have to be a lot of thought put into where exactly the API would split between the Spark backend and our own dedicated Actor system backed, and some harmonization would need to happen; we'd love to incorporate a lot of the great ideas Spark has for scheduling tasks, but also remain with a situation where local and high speed use cases did not need to run through unnecessary machinery, for performance in the small scale. This is all in early stages of consideration, so any input in design ideas is very welcome!
The aim from the start of a Spark support story would be to implement all GeoTrellis operations that currently support distribution over tiled rasters to be supported in the Spark environment, so Map Algebra operations like Classification would be carried over as a first step. As far as feature extraction and pyramid generation, these are operations that GeoTrellis currently does not have (besides basic vectorization capabilities), as our focus has been more on implementing fast Map Algebra operations, but these would certainly be great additions to any geospatial data analysis library. Thanks for your ideas, and looking forward to your participation. Cheers, Rob On Thu, Nov 7, 2013 at 3:05 PM, andy petrella <andy.petre...@gmail.com>wrote: > Hello Rob, > > As you may know I have a long experience in Geospatial data, and I'm now > investigating Spark... So I'll be very interested further answers but also > to participate to going forward on this great idea! > > For instance, I'd say that implementing classical geospatial algorithms > like classification, feature extraction, pyramid generation and so on would > be a geo-extension lib to Spark, this would be easier using Geotrellis API. > > My only question, for now, is that Geotrellis has his own notion of > lineage and Spark as well, so maybe some harmonization work will have to be > done to serialize and schedule them? Maybe Pickles could help for the > serialization part... > > Sorry If I miss something (or even said stupidities ^^)... I'm going now > to the thread you mentioned! > > Looking forward ;) > > Cheers > andy > > > On Thu, Nov 7, 2013 at 8:49 PM, Rob Emanuele <lossy...@gmail.com> wrote: > >> Hello, >> >> I'm a developer on the GeoTrellis project (http://geotrellis.github.io). >> We do fast raster processing over large data sets, from web-time >> (sub-100ms) processing for live endpoints to distributed raster analysis >> over clusters using Akka clustering. >> >> There's currently discussion underway about moving to support a Spark >> backend for doing large scale distributed raster analysis. You can see the >> discussion here: >> https://groups.google.com/forum/#!topic/geotrellis-user/wkUOhFwYAvc. Any >> contributions to the discussion would be welcome. >> >> My question to the list is, is there currently any development towards a >> geospatial data story for Spark, that is, using Spark for large scale >> raster\vector spatial data analysis? Is there anyone using Spark currently >> for this sort of work? >> >> Thanks, >> Rob Emanuele >> > > -- Rob Emanuele, GIS Software Engineer Azavea | 340 N 12th St, Ste 402, Philadelphia, PA remanu...@azavea.com | T 215.701.7692 | F 215.925.2663 Web azavea.com <http://www.azavea.com/> | Blog azavea.com/blogs<http://www.azavea.com/Blogs> | Twitter @azavea <http://twitter.com/azavea>