Re: Long-term thoughts about big-data queries in SIS

Marc Le Bihan Tue, 10 Nov 2015 10:45:25 -0800

Hello,

About the development of the SQL Driver I can make you a status on it :

1) The driver is currently working for SELECT statements over DBase IIIfiles, provided the WHERE clause used (if any) is limited to simpleconditions.

2) I use it currently with real DBase III files coming from variousplaces in order to challenge it.

3) Parsing of statements is now the main difficulty I have, and thissubject started a debate few months ago : if I continue clause by clause(attempting to detect a GROUP BY, a HAVING, a LIKE ... "manually" it will belong and difficult.If I use a parser like AntLR, it will be potent and complete, but thisAPI is known to be really hard to handle and to make working perfectly. Iused it four times, but I still fear each time I'm using it. But I thinkthat it's the only solution.


   4) UPDATE statement could come quickly.

5) For DELETE I have to see if logical delete can be done to avoidre-writing the whole file, and for INSERT a new entry will have to be set.It's not easy, because if an index file comes with the DBF III file I haveto update it too.And also : I have to find a way to manage Shapefile reader to continuefollowing the content of the DBase file. If I destroy a record on the DBasefile, the associated entry in Shapefile should not be valid anymore, forexample.And removals or insertions in shapefile would lead to make some changesin their index files too.

6) The interfaces, the abstract subclasses used to help theimplementation of DBase III connection, statement, resultset, metadata willbe helpful to develop another driver for another kind of database, ifneeded. But these abstract classes still might change :

   I expect many discoveries until the end.

7) What the good choice after this first part of work (the CRUDoperations) : being able to handle transactions, or being able to implementJPA interfaces ? The two are valuables goals.


Regards,

Marc Le Bihan

-----Message d'origine-----From: Adam Estrada

Sent: Tuesday, November 10, 2015 7:20 PM
To: [email protected]
Cc: Mark Giaconia
Subject: Re: Long-term thoughts about big-data queries in SIS

Martin,

This is extremely cool and much needed in the geospatial community! My
company, DigitalGlobe, has done a lot with this and has open sourced
many of the packages that can be found on GitHub today. Rasdaman[1]
and PostGIS Raster are other open source examples of how to do this in
relational databases. We have done a lot of research on how to store
pixels and query for them in HBASE/Hadoop and ElasticSearch too. There
are many options to this one!

Adam

[1] http://rasdaman.org/

On Tue, Nov 10, 2015 at 6:09 AM, Martin Desruisseaux
<[email protected]> wrote:

Hello all

In the BigData Apache Conference in Budapest, I attended to some
meetings about exploiting geospatial big data using SQL language. I
though that we could make some long-term plans that could impact the
SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This
email is not a request for any change now. This is just a proposal about
some possible long term plans.

In one or two years, Apache SIS would hopefully have some DataStore
implementations ready for production use. But we have a strong request
for capability to use DataStores with big-data technologies like Hadoop.
This request increases the challenge of writing a SQL driver, since a
sophisticated SQL driver would need to be able to restructure query
plans according the available clusters.

I had a discussion with peoples from Apache Drill project
(https://drill.apache.org/), which already provide SQL parsing
capabilities in various big-data environments. In my understanding,
instead of writing our own SQL parser in SIS we could have the following
approach:

 1. Complete the org.apache.sis.storage.DataStore API (it is currently
    very minimalist).
 2. Have the ShapeFile store to extend the abstract SIS DataStore.
 3. In a separated module, write a "SIS DataStore to Drill DataStore"
    adapter. It should work for any SIS DataStore, not only the
    ShapeFile one.

In my understanding once we have a Drill DataStore implementation (I do
not know yet what is the exact name in Drill API), we should
automatically get big-data-ready SQL for any SIS DataStore. If for any
reason Drill DataStore is considered not suitable, we could fallback on
Apache Calcite (http://calcite.apache.org/), which is the SQL parser
used under the hood by Drill. Another project that may be worth to
explore is Magellan: Geospatial Analytics on Spark [2].

My proposal could be summarized as below: maybe in 2016 or 2017, we
could consider to put the SIS SQL support in its own module and allows
it to run not only for ShapeFile, but for any SIS DataStore, if possible
using technology like Drill designed for big-data environments.

Any thoughts?

    Martin


[1] https://issues.apache.org/jira/browse/SIS-180
[2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/

Re: Long-term thoughts about big-data queries in SIS

Reply via email to