Long-term thoughts about big-data queries in SIS

Martin Desruisseaux Tue, 10 Nov 2015 03:10:23 -0800

Hello all

In the BigData Apache Conference in Budapest, I attended to some
meetings about exploiting geospatial big data using SQL language. I
though that we could make some long-term plans that could impact the
SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This
email is not a request for any change now. This is just a proposal about
some possible long term plans.


In one or two years, Apache SIS would hopefully have some DataStore
implementations ready for production use. But we have a strong request
for capability to use DataStores with big-data technologies like Hadoop.
This request increases the challenge of writing a SQL driver, since a
sophisticated SQL driver would need to be able to restructure query
plans according the available clusters.

I had a discussion with peoples from Apache Drill project
(https://drill.apache.org/), which already provide SQL parsing
capabilities in various big-data environments. In my understanding,
instead of writing our own SQL parser in SIS we could have the following
approach:

 1. Complete the org.apache.sis.storage.DataStore API (it is currently
    very minimalist).
 2. Have the ShapeFile store to extend the abstract SIS DataStore.
 3. In a separated module, write a "SIS DataStore to Drill DataStore"
    adapter. It should work for any SIS DataStore, not only the
    ShapeFile one.

In my understanding once we have a Drill DataStore implementation (I do
not know yet what is the exact name in Drill API), we should
automatically get big-data-ready SQL for any SIS DataStore. If for any
reason Drill DataStore is considered not suitable, we could fallback on
Apache Calcite (http://calcite.apache.org/), which is the SQL parser
used under the hood by Drill. Another project that may be worth to
explore is Magellan: Geospatial Analytics on Spark [2].

My proposal could be summarized as below: maybe in 2016 or 2017, we
could consider to put the SIS SQL support in its own module and allows
it to run not only for ShapeFile, but for any SIS DataStore, if possible
using technology like Drill designed for big-data environments.

Any thoughts?

    Martin


[1] https://issues.apache.org/jira/browse/SIS-180
[2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/

Long-term thoughts about big-data queries in SIS

Reply via email to