Hello,
About the development of the SQL Driver I can make you a status on it :
1) The driver is currently working for SELECT statements over DBase III
files, provided the WHERE clause used (if any) is limited to simple
conditions.
2) I use it currently with real DBase III files coming from various
places in order to challenge it.
3) Parsing of statements is now the main difficulty I have, and this
subject started a debate few months ago : if I continue clause by clause
(attempting to detect a GROUP BY, a HAVING, a LIKE ... "manually" it will be
long and difficult.
If I use a parser like AntLR, it will be potent and complete, but this
API is known to be really hard to handle and to make working perfectly. I
used it four times, but I still fear each time I'm using it. But I think
that it's the only solution.
4) UPDATE statement could come quickly.
5) For DELETE I have to see if logical delete can be done to avoid
re-writing the whole file, and for INSERT a new entry will have to be set.
It's not easy, because if an index file comes with the DBF III file I have
to update it too.
And also : I have to find a way to manage Shapefile reader to continue
following the content of the DBase file. If I destroy a record on the DBase
file, the associated entry in Shapefile should not be valid anymore, for
example.
And removals or insertions in shapefile would lead to make some changes
in their index files too.
6) The interfaces, the abstract subclasses used to help the
implementation of DBase III connection, statement, resultset, metadata will
be helpful to develop another driver for another kind of database, if
needed. But these abstract classes still might change :
I expect many discoveries until the end.
7) What the good choice after this first part of work (the CRUD
operations) : being able to handle transactions, or being able to implement
JPA interfaces ? The two are valuables goals.
Regards,
Marc Le Bihan
-----Message d'origine-----
From: Adam Estrada
Sent: Tuesday, November 10, 2015 7:20 PM
To: [email protected]
Cc: Mark Giaconia
Subject: Re: Long-term thoughts about big-data queries in SIS
Martin,
This is extremely cool and much needed in the geospatial community! My
company, DigitalGlobe, has done a lot with this and has open sourced
many of the packages that can be found on GitHub today. Rasdaman[1]
and PostGIS Raster are other open source examples of how to do this in
relational databases. We have done a lot of research on how to store
pixels and query for them in HBASE/Hadoop and ElasticSearch too. There
are many options to this one!
Adam
[1] http://rasdaman.org/
On Tue, Nov 10, 2015 at 6:09 AM, Martin Desruisseaux
<[email protected]> wrote:
Hello all
In the BigData Apache Conference in Budapest, I attended to some
meetings about exploiting geospatial big data using SQL language. I
though that we could make some long-term plans that could impact the
SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This
email is not a request for any change now. This is just a proposal about
some possible long term plans.
In one or two years, Apache SIS would hopefully have some DataStore
implementations ready for production use. But we have a strong request
for capability to use DataStores with big-data technologies like Hadoop.
This request increases the challenge of writing a SQL driver, since a
sophisticated SQL driver would need to be able to restructure query
plans according the available clusters.
I had a discussion with peoples from Apache Drill project
(https://drill.apache.org/), which already provide SQL parsing
capabilities in various big-data environments. In my understanding,
instead of writing our own SQL parser in SIS we could have the following
approach:
1. Complete the org.apache.sis.storage.DataStore API (it is currently
very minimalist).
2. Have the ShapeFile store to extend the abstract SIS DataStore.
3. In a separated module, write a "SIS DataStore to Drill DataStore"
adapter. It should work for any SIS DataStore, not only the
ShapeFile one.
In my understanding once we have a Drill DataStore implementation (I do
not know yet what is the exact name in Drill API), we should
automatically get big-data-ready SQL for any SIS DataStore. If for any
reason Drill DataStore is considered not suitable, we could fallback on
Apache Calcite (http://calcite.apache.org/), which is the SQL parser
used under the hood by Drill. Another project that may be worth to
explore is Magellan: Geospatial Analytics on Spark [2].
My proposal could be summarized as below: maybe in 2016 or 2017, we
could consider to put the SIS SQL support in its own module and allows
it to run not only for ShapeFile, but for any SIS DataStore, if possible
using technology like Drill designed for big-data environments.
Any thoughts?
Martin
[1] https://issues.apache.org/jira/browse/SIS-180
[2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/