Hello all In the BigData Apache Conference in Budapest, I attended to some meetings about exploiting geospatial big data using SQL language. I though that we could make some long-term plans that could impact the SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This email is not a request for any change now. This is just a proposal about some possible long term plans.
In one or two years, Apache SIS would hopefully have some DataStore implementations ready for production use. But we have a strong request for capability to use DataStores with big-data technologies like Hadoop. This request increases the challenge of writing a SQL driver, since a sophisticated SQL driver would need to be able to restructure query plans according the available clusters. I had a discussion with peoples from Apache Drill project (https://drill.apache.org/), which already provide SQL parsing capabilities in various big-data environments. In my understanding, instead of writing our own SQL parser in SIS we could have the following approach: 1. Complete the org.apache.sis.storage.DataStore API (it is currently very minimalist). 2. Have the ShapeFile store to extend the abstract SIS DataStore. 3. In a separated module, write a "SIS DataStore to Drill DataStore" adapter. It should work for any SIS DataStore, not only the ShapeFile one. In my understanding once we have a Drill DataStore implementation (I do not know yet what is the exact name in Drill API), we should automatically get big-data-ready SQL for any SIS DataStore. If for any reason Drill DataStore is considered not suitable, we could fallback on Apache Calcite (http://calcite.apache.org/), which is the SQL parser used under the hood by Drill. Another project that may be worth to explore is Magellan: Geospatial Analytics on Spark [2]. My proposal could be summarized as below: maybe in 2016 or 2017, we could consider to put the SIS SQL support in its own module and allows it to run not only for ShapeFile, but for any SIS DataStore, if possible using technology like Drill designed for big-data environments. Any thoughts? Martin [1] https://issues.apache.org/jira/browse/SIS-180 [2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/
