Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Martin, Le 08/11/2014 07:06, Martin Dobias a écrit : Hi Hugo On Tue, Nov 4, 2014 at 8:36 PM, Hugo Mercier hugo.merc...@oslandia.com wrote: * About indexes on virtual tables, contrary to what I wrote previously, the xBestIndex() method of virtual tables should be enough to orient the planner, an estimated cost and estimated number of rows can be returned for each part of the where clause. So there should be no need to copy native indexes. But the provider interface should be extended in order to provide such statistics. I am trying to figure out how the things would work... How would you decide whether to copy native index of a virtual table? (Always?) E.g. pkey at least in Postgres is also index - shall we always copy it? When would you make copies of native indexes? (When constructing the provider? On every getFeatures() call?) How would you keep the index up to date? (Table may change outside of QGIS) What I want to understand is how SQL select statements would execute - like the one with join filter you need: SELECT * FROM tblA LEFT JOIN tblB ON tblA.X = tblB.Y WHERE tblY.Z = 42 The idea is precisely to avoid copy of native indexes. With this example query, the virtual table implementation of tblA and tblB will be called (xBestIndex) twice : once with and once without constraints : * X = ? on tblA * Y = ? AND Z = ? on tblB If you know there is a native index on tblA(X) then you will return a cost for this constraint that is inferior to the cost needed without constraint (a seq scan). The planner will decide what to do based on these costs. Some discussion here : http://osdir.com/ml/sqlite-dev/2014-11/msg3.html For spatial indexes this is still possible (to avoid copy of native indexes), but would require to introduce some new syntax to translate the spatial predicate into regular comparison operators. See here : https://www.mail-archive.com/sqlite-users@sqlite.org/msg87191.html ... so for spatial indexes some additional SQL must be generated from QGIS (but it is already the case if we want to avoid the spatialite index syntax) We are willing to develop this as a plugin if it can be included in QGIS as a c++ plugin. Is there any objection to this ? Personally I do not see much difference in having some functionality in 'core' and having some functionality in c++ plugin (included in QGIS tree) - in both cases that requires us to maintain such code once it is added... Why not develop a prototype that does not require QGIS code - I guess the only QGIS-specific part is the virtual table implementation. Such prototype could eventually evolve to a standalone library providing SQL parser / execute engine with a nice interface... Ok. It means the optimization part (accessing costs of the native indexes for example, as discussed above) cannot be demonstrated yet with such an external prototype, because it would require to modify providers' interface in the core. But it does not prevent the plugin / library to take them into account. ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Hugo On Tue, Nov 4, 2014 at 8:36 PM, Hugo Mercier hugo.merc...@oslandia.com wrote: * About indexes on virtual tables, contrary to what I wrote previously, the xBestIndex() method of virtual tables should be enough to orient the planner, an estimated cost and estimated number of rows can be returned for each part of the where clause. So there should be no need to copy native indexes. But the provider interface should be extended in order to provide such statistics. I am trying to figure out how the things would work... How would you decide whether to copy native index of a virtual table? (Always?) E.g. pkey at least in Postgres is also index - shall we always copy it? When would you make copies of native indexes? (When constructing the provider? On every getFeatures() call?) How would you keep the index up to date? (Table may change outside of QGIS) What I want to understand is how SQL select statements would execute - like the one with join filter you need: SELECT * FROM tblA LEFT JOIN tblB ON tblA.X = tblB.Y WHERE tblY.Z = 42 We are willing to develop this as a plugin if it can be included in QGIS as a c++ plugin. Is there any objection to this ? Personally I do not see much difference in having some functionality in 'core' and having some functionality in c++ plugin (included in QGIS tree) - in both cases that requires us to maintain such code once it is added... Why not develop a prototype that does not require QGIS code - I guess the only QGIS-specific part is the virtual table implementation. Such prototype could eventually evolve to a standalone library providing SQL parser / execute engine with a nice interface... Cheers Martin ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi, Some additional information, after some more digging into sqlite code / documentation / mailing lists: * The parsing part of SQLite is private and cannot be requested by an external program. Moreover, the parse phase directly builds a program, not an abstract syntax tree. However, a formal grammar is available (http://www.sqlite.org/syntaxdiagrams.html) and the SQLite license is very permissive. So, reusing parts of the SQLite parser seems not that hard and would allow to parse a subset of SQL (only SELECT I guess) * About indexes on virtual tables, contrary to what I wrote previously, the xBestIndex() method of virtual tables should be enough to orient the planner, an estimated cost and estimated number of rows can be returned for each part of the where clause. So there should be no need to copy native indexes. But the provider interface should be extended in order to provide such statistics. Now for the more concrete part: we may have a funding for that development, but a decision has to be taken before this funding vanishes. So in order to advance on that topic, what we can propose is to develop a fully functional prototype, as a transition before being added to the core. It will implement a new data provider, with these user-oriented features : * an integration into the DBManager, in order to be able to use SQL on any layer * the ability to use a WHERE filter on layers that have a join. For the code part, a SELECT parser will be proposed, based on SQLite if possible and performance tests will be developed with different use cases, allowing to test for the planner performances as well as for the data copying part. We are willing to develop this as a plugin if it can be included in QGIS as a c++ plugin. Is there any objection to this ? Le 30/10/2014 10:32, Hugo Mercier a écrit : Hi Martin, Thanks for your review. Le 29/10/2014 17:34, Martin Dobias a écrit : Hi Hugo! On Tue, Oct 28, 2014 at 6:15 PM, Hugo Mercier hugo.merc...@oslandia.com wrote: are there other opinions ? Other arguments for one or the other side ? I have to say that initially I was very excited about the idea... and after thinking about the details I am getting less excited as there are probably lots of gotchas. Below there is list of questions in the order of reading the proposal. ** Use cases - item 1 - in what sense the layer would be dynamic? create a dynamic point layer based on X,Y(,Z) coordinates = if points coordinates in the underlying layer change, then their geometries can be updated during the next fetch. Like for example the creation of a point layer from a CSV file with the watch file option. ** Indexes - how will be indexes requested by underlying providers of primary sources? Will you add support to providers to return indices associated with a given primary source? How can sqlite make use of such information? Are you going to replicate the same indices for the virtual tables? Yes we will need something like this in order to properly use indexes of underlying layers. By populating sqlite_stat* tables. And yes I realize that may involve replication of indexes. For the case where more than one index is available on a table, the xBestIndex method of the virtual table mechanism is made for that: an estimated cost can be returned for each constraint - the need to use spatial indices explicitly (with such weird syntax) is not good - this should be done by DBMS, not by the user. In practice nobody will use them and the spatial queries will be slow Yes it is the Spatialite spatial index syntax. ** Serialization - I am lost. Are we going to materialize those sqlite views in memory or to disk? I thought they were meant to be just views wrapping QGIS layers Yes, they are. They can be part of a project file. The second paragraph is misleading. ** Implementation details - regarding QgsVectorLayer constructor: to me it does not look like a case special enough to require a new parameter - what does it mean the provider would be based on spatialite? a subclass of spatialite provider - or a copy of it - or something else? Not decided yet. But it is probably too early in the discussion to go into such details. - are the insert/update/delete triggers going to be provided by user or generated automatically? (I believe it should be automatic) As automatic as possible. But in the general case, it is not possible. Some columns cannot be updated (computed during the query). Or there may be more than one possibility to update a join. - how do we find out that the views from virtual layers are actually editable? If an update trigger is defined. - how do we determine whether the view has a geometry - and in case of multiple geometries, which one to choose? (geometry columns are just ordinary blobs, right?) - how do we determine primary key for a view? (for feature IDs) Good points. We will need to pass these information
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Martin, Thanks for your review. Le 29/10/2014 17:34, Martin Dobias a écrit : Hi Hugo! On Tue, Oct 28, 2014 at 6:15 PM, Hugo Mercier hugo.merc...@oslandia.com wrote: are there other opinions ? Other arguments for one or the other side ? I have to say that initially I was very excited about the idea... and after thinking about the details I am getting less excited as there are probably lots of gotchas. Below there is list of questions in the order of reading the proposal. ** Use cases - item 1 - in what sense the layer would be dynamic? create a dynamic point layer based on X,Y(,Z) coordinates = if points coordinates in the underlying layer change, then their geometries can be updated during the next fetch. Like for example the creation of a point layer from a CSV file with the watch file option. ** Indexes - how will be indexes requested by underlying providers of primary sources? Will you add support to providers to return indices associated with a given primary source? How can sqlite make use of such information? Are you going to replicate the same indices for the virtual tables? Yes we will need something like this in order to properly use indexes of underlying layers. By populating sqlite_stat* tables. And yes I realize that may involve replication of indexes. For the case where more than one index is available on a table, the xBestIndex method of the virtual table mechanism is made for that: an estimated cost can be returned for each constraint - the need to use spatial indices explicitly (with such weird syntax) is not good - this should be done by DBMS, not by the user. In practice nobody will use them and the spatial queries will be slow Yes it is the Spatialite spatial index syntax. ** Serialization - I am lost. Are we going to materialize those sqlite views in memory or to disk? I thought they were meant to be just views wrapping QGIS layers Yes, they are. They can be part of a project file. The second paragraph is misleading. ** Implementation details - regarding QgsVectorLayer constructor: to me it does not look like a case special enough to require a new parameter - what does it mean the provider would be based on spatialite? a subclass of spatialite provider - or a copy of it - or something else? Not decided yet. But it is probably too early in the discussion to go into such details. - are the insert/update/delete triggers going to be provided by user or generated automatically? (I believe it should be automatic) As automatic as possible. But in the general case, it is not possible. Some columns cannot be updated (computed during the query). Or there may be more than one possibility to update a join. - how do we find out that the views from virtual layers are actually editable? If an update trigger is defined. - how do we determine whether the view has a geometry - and in case of multiple geometries, which one to choose? (geometry columns are just ordinary blobs, right?) - how do we determine primary key for a view? (for feature IDs) Good points. We will need to pass these information to the constructor, as it is now with a postgres provider ** Python bindings - how would I create a virtual layer from Python? I would hope to do something like iface.addVectorLayer(SELECT * FROM layerX, virtual) - would that be possible? This kind of syntax (using loaded layer names instead of their explicit definition) is clean but would require : * that layers are already loaded before we can use them * then to save the loading order of layers in the project file The document refers to the use of explicit layer urls. I don't have a strong opinion on this choice. ** User interface - what is the referenced layers box for - is it just to inform user which layers he used? Or does he need to explicitly register those primary layers somewhere? See previous answer ** Performance - how will sqlite plan execution of queries? - inner join of two virtual tables based on attributes? - spatial joins? - does sqlite have access to any statistics of virtual tables and indices? These are quite important for reasonable query execution plans. See answer more on the top: sqlite_stat* tables, xBestIndex - During normal operation when iterating over vector layer there is conversion: backend - QgsFeature. As far as I understand the virtual tables add two more conversions during execution: QgsFeature - sqlite - QgsFeature (first in virtual table implementation, second in new provider implementation). Is that correct? Have you tried to measure additional overhead? Hmmm correct. I did not have any measure yet ... In summary: I have the impression that virtual tables are just a simple way to provide access to external data to sqlite, not an invitation to use sqlite engine with foreign data. The provided API simply does not look mature enough. I would not be too optimistic about
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
On 28.10.2014 16:30, Hugo Mercier wrote: Le 28/10/2014 13:21, Matthias Kuhn a écrit : Hi Hugo The optimization plan could be: don't try to optimize in the general case (premature optimization ...), only optimize specific well-identified cases. For now the only simple case I can see is when a join is done on tables from the same database (and the user don't know or can't do a join on the remote database ... or a developer wants to develop a provider-independent plugin/core functionality that has a good cross-table performance. Do you have something in mind ? Spontaneously: Creating reports based on aggregated statistics, but the field is widely open. Indeed, I don't know well the sqlite dev community. I will try to know a bit more about exposure of internal SQLite data structures. In the worst case we could decide to fork or include parts of the code the license is very permissive. Btw, there is a spatialite / sqlite source distribution in src/spatialite (probably only used when WITH_INTERNAL_SPATIALITE is enabled) I think in src/core/spatialite there is only the spatialite and not the sqlite sourcecode. I normally don't compile with internal spatialite (it's an old version? and I like to stay close to upstream and avoid duplicated libs). It also opens the problem of name collisions for linking/hard to handle interdependencies between external and internal sqlite versions and will require us to manually keep the code in sync when new versions are released. It would probably be hard (but surely not impossible) and it will probably need ongoing effort. Therefore it would be very welcome if the sqlite community is open to changes. Ok, but this performance issue is not related to the backend choice (SQLite or our own), right ? The performance issue itself not. The possible optimization we are discussing now, yes :-) Matthias -- Help getting QGIS to the next level of quality before November 15! http://blog.vitu.ch/10102014-1046/crowdfunding-initiative-automated-testing ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Régis, On 28.10.2014 17:35, Régis Haubourg wrote: Hi all, this is a very high level topic for me. I will just point out some user/funder needs, and maybe try to describe other strategies exist in GIS softs. As a user coming from both ESRI and Mapinfo world: - The main (and last?) real advantage of MI was its native pseudo-SQL capabilities (with many limitations). Users really often now have a centralized spatialDB, where the administrator can do many things (almost everything in fact), but the classical user has only two possibilies for relation Db treatments: algorithms or duplicate data in sqlite and learn spatial sql (unless their DBA grant them to Postgres... ). Power users are very happy, ex MI users are not in QGIS. I've also heard that from others. I think it's a very valid requirement and pretty much what QEP/RFC 5 by Hugo proposes. - Arcgis have no agregate capabilities over its relations.. Joins are more developped and allow summarize algoithms. To go further, It requires using an external DB system, and it's a pain (they all have limitations, in term of cost or learning curve..). Postgis/sqlite is far away in terms of accessibility Having virtual sqlite tables would allow some MI-like behaviour. Users here would appreciate that a lot (I will still be using postgres underneath for perfs). And that's where my main questionmarks are: Will this implementation be able to make use of postgres capabilities? In particular, a postgres database has a higher constant per-request cost in comparison to local data (network roundtrip vs file access) but makes this up through efficient filtering/aggregation. And if there are problem, how many possibilities do we have to improve the situation. For more simple use cases, some only want to improve joins by being able to join and output agregate values if multiples tuples join, for simple mapping purposes. I think UI needs boths things, a SQL dialog to create queries - Qspatialite-like - on every table, and some agregate capabilities over joins. What we must avoid is having two different implementations, with differents limitations. Only power users knowing what is really happening underneath will know what function to use, which is bad in UX terms. A comparison is OGR CSV driver versus CSV plugin... User has to know that both tools are differents, behaves differently with a different providers. Nothing in the GUI let you know that except try-fail approach. The problem is that we have one proposal which can be available soon with lots of functionality and a not-too-hard implementation. But we/I lack knowledge if it performs well in every scenario and if it offers the possibility to optimize. BTW, I have the feeling you don't disagree at all but that we are digging one of the harder features of a GIS tool. IMHO, that really desserves discussions, prooves of concept. Any other opinions in dev's? I also don't think it's disagreement, it's evaluation of risks/chances involved by either of the two. Best Matthias -- Help getting QGIS to the next level of quality before November 15! http://blog.vitu.ch/10102014-1046/crowdfunding-initiative-automated-testing ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hello, Le mercredi 29 octobre 2014 08:34:11, Matthias Kuhn a écrit : [...] limitations. Only power users knowing what is really happening underneath will know what function to use, which is bad in UX terms. A comparison is OGR CSV driver versus CSV plugin... User has to know that both tools are differents, behaves differently with a different providers. Nothing in the GUI let you know that except try-fail approach. The problem is that we have one proposal which can be available soon with lots of functionality and a not-too-hard implementation. But we/I lack knowledge if it performs well in every scenario and if it offers the possibility to optimize. The thing is, we cannot forecast everything, and without a working prototype, we will probably never be able to determine the behaviour for every scenario, as we probably cannot imagine every scenario at all. Premature optimization is the root of all evil. At some point we need some working code to be able to start iterating over something real, more than trying to imagine a perfect plan and see nothing coming at the end. QEP are there to make sure the global orientation and design is good and is coherent with the rest of the project. They cannot be fully exhaustive, and there will probably be other QEP regarding this feature set to enhance and fix the first implementation. We could even go back if this approach proves to fail. Failure is progress too. Vincent BTW, I have the feeling you don't disagree at all but that we are digging one of the harder features of a GIS tool. IMHO, that really desserves discussions, prooves of concept. Any other opinions in dev's? I also don't think it's disagreement, it's evaluation of risks/chances involved by either of the two. Best Matthias ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Vincent On 29.10.2014 10:51, Vincent Picavet wrote: Hello, Le mercredi 29 octobre 2014 08:34:11, Matthias Kuhn a écrit : [...] limitations. Only power users knowing what is really happening underneath will know what function to use, which is bad in UX terms. A comparison is OGR CSV driver versus CSV plugin... User has to know that both tools are differents, behaves differently with a different providers. Nothing in the GUI let you know that except try-fail approach. The problem is that we have one proposal which can be available soon with lots of functionality and a not-too-hard implementation. But we/I lack knowledge if it performs well in every scenario and if it offers the possibility to optimize. The thing is, we cannot forecast everything, and without a working prototype, we will probably never be able to determine the behaviour for every scenario, as we probably cannot imagine every scenario at all. Premature optimization is the root of all evil. It's not premature optimization. It's evaluation of a likely scenario (at least nobody disagreed with a likely performance problem so far) and early prevention of a vendor lock-in by looking ahead. At some point we need some working code to be able to start iterating over something real, more than trying to imagine a perfect plan and see nothing coming at the end. QEP are there to make sure the global orientation and design is good and is coherent with the rest of the project. They cannot be fully exhaustive, and there will probably be other QEP regarding this feature set to enhance and fix the first implementation. We could even go back if this approach proves to fail. Failure is progress too. There would also be the possibility to create this feature as a plugin instead of implementing it in core. I think this has been discussed before if I remember right. If it's in core it's very unlikely that we will go back. And going back would certainly not make life easier for power users. I don't want to promote the option of implementing it as a plugin, I am just listing it for the sake of completeness. I think the most simple answer to my questions would probably be a statement from the sqlite virtual table developers stating that they are open to collaboration. That would literally vaporize any concerns I had. It should not be hard to ask. But again: I do like the general idea and I see the potential of this! Please do not treat this message as a fundal disagreement with the QEP. Matthias -- Help getting QGIS to the next level of quality before November 15! http://blog.vitu.ch/10102014-1046/crowdfunding-initiative-automated-testing ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Hugo! On Tue, Oct 28, 2014 at 6:15 PM, Hugo Mercier hugo.merc...@oslandia.com wrote: are there other opinions ? Other arguments for one or the other side ? I have to say that initially I was very excited about the idea... and after thinking about the details I am getting less excited as there are probably lots of gotchas. Below there is list of questions in the order of reading the proposal. ** Use cases - item 1 - in what sense the layer would be dynamic? ** Indexes - how will be indexes requested by underlying providers of primary sources? Will you add support to providers to return indices associated with a given primary source? How can sqlite make use of such information? Are you going to replicate the same indices for the virtual tables? - the need to use spatial indices explicitly (with such weird syntax) is not good - this should be done by DBMS, not by the user. In practice nobody will use them and the spatial queries will be slow ** Serialization - I am lost. Are we going to materialize those sqlite views in memory or to disk? I thought they were meant to be just views wrapping QGIS layers ** Implementation details - regarding QgsVectorLayer constructor: to me it does not look like a case special enough to require a new parameter - what does it mean the provider would be based on spatialite? a subclass of spatialite provider - or a copy of it - or something else? - are the insert/update/delete triggers going to be provided by user or generated automatically? (I believe it should be automatic) - how do we find out that the views from virtual layers are actually editable? - how do we determine whether the view has a geometry - and in case of multiple geometries, which one to choose? (geometry columns are just ordinary blobs, right?) - how do we determine primary key for a view? (for feature IDs) ** Python bindings - how would I create a virtual layer from Python? I would hope to do something like iface.addVectorLayer(SELECT * FROM layerX, virtual) - would that be possible? ** User interface - what is the referenced layers box for - is it just to inform user which layers he used? Or does he need to explicitly register those primary layers somewhere? ** Performance - how will sqlite plan execution of queries? - inner join of two virtual tables based on attributes? - spatial joins? - does sqlite have access to any statistics of virtual tables and indices? These are quite important for reasonable query execution plans. - During normal operation when iterating over vector layer there is conversion: backend - QgsFeature. As far as I understand the virtual tables add two more conversions during execution: QgsFeature - sqlite - QgsFeature (first in virtual table implementation, second in new provider implementation). Is that correct? Have you tried to measure additional overhead? In summary: I have the impression that virtual tables are just a simple way to provide access to external data to sqlite, not an invitation to use sqlite engine with foreign data. The provided API simply does not look mature enough. I would not be too optimistic about openness of sqlite project - the developer list is not publicly available and I haven't found a link to repository either (except for the commit log) - but I may be wrong. Aren't there any other embedded SQL engines that we could use? Maybe we should have a simple SQL 'SELECT' statement parser that would parse the queries and then decide what to do with them: - execute on backend (if all tables are from the same DB) - execute with existing QGIS functionality (if we have efficient algorithm for that) - execute with sqlite engine (with materialized views) Cheers Martin ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Hugo, Sorry for my slow response, I was quite busy the last days. I'll try to explain a bit more what I have in mind. For example there is the new expression function getFeature() which is an ad-hoc replacement for a join. I have been told that the performance for rendering is not very good, I assume this is caused by a lot of requests performed in the background. It would be great to be able to join this on the server side. I would be very happy to see more of this introduced. For example I could well imagine a new Expression (Syntax just made up, that can still be designed in a more appropriate way) avg( c.population ) { LEFT JOIN cities c ON c.country_code = $thistable.code } This should then be locally evaluated (for shapefiles et al) but forwarded to the server for database services (postgres and the like). Now, as we have different backends and their syntax does not always match (most do standard sql, but there are extensions, think of ST_* in postgis, SDO_* in oracle spatial etc, and there is no guarantee, that a dataprovider supports standard SQL) we need an abstraction layer and cannot just assume that every dataprovider will understand the same thing. I am planning to implement this abstraction layer for ordinary expressions in the near future. And I would be very happy to see that there is a way to extend this with aggregate functions. This will all require a fallback level (if the server does not support a given functionality, for shapefiles, to be sure that the result is the very same regardless of server function implementation details...). We already have this for ordinary expressions as we can just use the QgsExpression implementation. I can only repeat, that I will be more than happy to see this fallback level implemented by means of sqlite virtual tables. But in my humble opinion it would be very unfortunate if there would be no way to optimize this tomorrow or the day after tomorrow. That's why I am asking if sqlite developers are open to collaborate because if the will to collaborate there is missing we will end up having to reimplement this functionality. I totally agree with your statement A general cross-database SQL optimizer would need lots of work (in QGIS and/or SQLite) and may come later (if really needed)., but would append and will require either a re-implementation in QGIS, ending up in duplicated functionality (that's what Régis was worried about in the original mail) or will need the possibility to get deeper access to sqlite's internal data structures or callbacks to optimize where required. This will require cooperation from SQLite developer's side or forking sqlite.. Best regards, Matthias -- Please help taking QGIS to the next level of quality. Before November 15 ! http://blog.vitu.ch/10102014-1046/crowdfunding-initiative-automated-testing ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Matthias, Le 28/10/2014 09:46, Matthias Kuhn a écrit : Hi Hugo, Sorry for my slow response, I was quite busy the last days. With the 2.6 coming I know this is not the best time for this kind of discussion :( So thanks for your time. I'll try to explain a bit more what I have in mind. For example there is the new expression function getFeature() which is an ad-hoc replacement for a join. I have been told that the performance for rendering is not very good, I assume this is caused by a lot of requests performed in the background. It would be great to be able to join this on the server side. I agree. This getFeature() is a symptom that we need more advanced relational-oriented functionalities. I would be very happy to see more of this introduced. For example I could well imagine a new Expression (Syntax just made up, that can still be designed in a more appropriate way) avg( c.population ) { LEFT JOIN cities c ON c.country_code = $thistable.code } This should then be locally evaluated (for shapefiles et al) but forwarded to the server for database services (postgres and the like). How would it be forwarded for database services ? If the two tables are from the same db, then ok, we are in the simple optimization case that I was talking about previously : more or less a translation between SQL dialects. But what if the current table is a local file (shapefile) and cities is a postgis table ? You would then need a magical optimization engine that will say you need to send the list of local country codes to a remote query like : SELECT avg(population) FROM cities WHERE country_code IN ( $local_country_code_array ) GROUP BY country_code But actually, the best optimization depends in the general case on the actual data stored in tables. If the bandwith required to send the IN part to PostGIS is too important, there may be cases where it is better to fetch everything remotely and sort and group by locally or they may be cases where another query would be better. This is why writing an optimal SQL query is hard and why database planners do lots of work, rely on lots of heuristics and statistics and are hard to implement. Now, as we have different backends and their syntax does not always match (most do standard sql, but there are extensions, think of ST_* in postgis, SDO_* in oracle spatial etc, and there is no guarantee, that a dataprovider supports standard SQL) we need an abstraction layer and cannot just assume that every dataprovider will understand the same thing. Yes, exactly, and this was to whole point of relying on SQLite and Spatialite as an abstraction layer. I am planning to implement this abstraction layer for ordinary expressions in the near future. And I would be very happy to see that there is a way to extend this with aggregate functions. This will all require a fallback level (if the server does not support a given functionality, for shapefiles, to be sure that the result is the very same regardless of server function implementation details...). We already have this for ordinary expressions as we can just use the QgsExpression implementation. I can only repeat, that I will be more than happy to see this fallback level implemented by means of sqlite virtual tables. But in my humble opinion it would be very unfortunate if there would be no way to optimize this tomorrow or the day after tomorrow. That's why I am asking if sqlite developers are open to collaborate because if the will to collaborate there is missing we will end up having to reimplement this functionality. I totally agree with your statement A general cross-database SQL optimizer would need lots of work (in QGIS and/or SQLite) and may come later (if really needed)., but would append and will require either a re-implementation in QGIS, ending up in duplicated functionality (that's what Régis was worried about in the original mail) or will need the possibility to get deeper access to sqlite's internal data structures or callbacks to optimize where required. This will require cooperation from SQLite developer's side or forking sqlite.. I really don't think the difficulty is here. Even if we have access to the parsed SQL and even to the result of the SQLite planner, splitting the execution plan in optimal local and remote operations is *very hard* in the general case. And we would have to do it whatever the abstraction layer we choose : SQLite or our own QgsExpression-based solution. And I really don't think that would be a good idea to do this in QGIS. The optimization plan could be: don't try to optimize in the general case (premature optimization ...), only optimize specific well-identified cases. For now the only simple case I can see is when a join is done on tables from the same database (and the user don't know or can't do a join on the remote database), then yes we would have to know the syntax of the original query and translate it into the desired
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi Hugo On 28.10.2014 12:15, Hugo Mercier wrote: Hi Matthias, Le 28/10/2014 09:46, Matthias Kuhn a écrit : Hi Hugo, Sorry for my slow response, I was quite busy the last days. With the 2.6 coming I know this is not the best time for this kind of discussion :( So thanks for your time. I'll try to explain a bit more what I have in mind. For example there is the new expression function getFeature() which is an ad-hoc replacement for a join. I have been told that the performance for rendering is not very good, I assume this is caused by a lot of requests performed in the background. It would be great to be able to join this on the server side. I agree. This getFeature() is a symptom that we need more advanced relational-oriented functionalities. I would be very happy to see more of this introduced. For example I could well imagine a new Expression (Syntax just made up, that can still be designed in a more appropriate way) avg( c.population ) { LEFT JOIN cities c ON c.country_code = $thistable.code } This should then be locally evaluated (for shapefiles et al) but forwarded to the server for database services (postgres and the like). How would it be forwarded for database services ? If the two tables are from the same db, then ok, we are in the simple optimization case that I was talking about previously : more or less a translation between SQL dialects. But what if the current table is a local file (shapefile) and cities is a postgis table ? In this case I think the sqlite virtual table approach (or any other local filtering method) should be preferred. That would involve too much black magic. There may be some cases where we can mix things (e.g. two parts of a where clause combined with AND where only for one of the two WHERE clauses a native SQL translation can be found could be performed as a two-step filter, but that's really not the main point) You would then need a magical optimization engine that will say you need to send the list of local country codes to a remote query like : SELECT avg(population) FROM cities WHERE country_code IN ( $local_country_code_array ) GROUP BY country_code But actually, the best optimization depends in the general case on the actual data stored in tables. If the bandwith required to send the IN part to PostGIS is too important, there may be cases where it is better to fetch everything remotely and sort and group by locally or they may be cases where another query would be better. This is why writing an optimal SQL query is hard and why database planners do lots of work, rely on lots of heuristics and statistics and are hard to implement. Now, as we have different backends and their syntax does not always match (most do standard sql, but there are extensions, think of ST_* in postgis, SDO_* in oracle spatial etc, and there is no guarantee, that a dataprovider supports standard SQL) we need an abstraction layer and cannot just assume that every dataprovider will understand the same thing. Yes, exactly, and this was to whole point of relying on SQLite and Spatialite as an abstraction layer. I am planning to implement this abstraction layer for ordinary expressions in the near future. And I would be very happy to see that there is a way to extend this with aggregate functions. This will all require a fallback level (if the server does not support a given functionality, for shapefiles, to be sure that the result is the very same regardless of server function implementation details...). We already have this for ordinary expressions as we can just use the QgsExpression implementation. I can only repeat, that I will be more than happy to see this fallback level implemented by means of sqlite virtual tables. But in my humble opinion it would be very unfortunate if there would be no way to optimize this tomorrow or the day after tomorrow. That's why I am asking if sqlite developers are open to collaborate because if the will to collaborate there is missing we will end up having to reimplement this functionality. I totally agree with your statement A general cross-database SQL optimizer would need lots of work (in QGIS and/or SQLite) and may come later (if really needed)., but would append and will require either a re-implementation in QGIS, ending up in duplicated functionality (that's what Régis was worried about in the original mail) or will need the possibility to get deeper access to sqlite's internal data structures or callbacks to optimize where required. This will require cooperation from SQLite developer's side or forking sqlite.. I really don't think the difficulty is here. Even if we have access to the parsed SQL and even to the result of the SQLite planner, splitting the execution plan in optimal local and remote operations is *very hard* in the general case. And we would have to do it whatever the abstraction layer we choose : SQLite or our own QgsExpression-based solution.
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Le 28/10/2014 13:21, Matthias Kuhn a écrit : Hi Hugo The optimization plan could be: don't try to optimize in the general case (premature optimization ...), only optimize specific well-identified cases. For now the only simple case I can see is when a join is done on tables from the same database (and the user don't know or can't do a join on the remote database ... or a developer wants to develop a provider-independent plugin/core functionality that has a good cross-table performance. Do you have something in mind ? ), then yes we would have to know the syntax of the original query and translate it into the desired SQL dialect. And yes it would be better to have it from SQLite. It would require either to work with the developers, to patch it or to somehow include the parsing part of SQLite (we already ship sqlite / spatialite with QGIS right ?). But I really don't see it as a blocker. I don't think we do ship it but I may be wrong? Please don't get me wrong, I didn't say it's a blocker. The original discussion started from Régis statement there's a QEP taking care of relational queries, is there a need to duplicate functionality? and my response I am not sure if sqlite virtual tables will be able to satisfy all our needs. I was just wondering about the limitations of this approach and I am still very open to hear about any experiences you had with sqlite developers (I have none) because that is what counts when it comes to working with upstream as you proposed. Indeed, I don't know well the sqlite dev community. I will try to know a bit more about exposure of internal SQLite data structures. In the worst case we could decide to fork or include parts of the code the license is very permissive. Btw, there is a spatialite / sqlite source distribution in src/spatialite (probably only used when WITH_INTERNAL_SPATIALITE is enabled) I really think using SQLite as our database engine has a good potential. It could extend the abilities and expressivity of QGIS. And it could also allow to use *less* code in QGIS (seeing every provider as a virtual layer). And it could also paves the way for a native QGIS file format (this is another discussion, but somehow related). But whatever the implementation is, trying to be (automatically) as performant as a well-designed query on a well-designed db is a waste of energy. Yes, we cannot do that. But we can try to find a way that allows developers to develop and users to create projects independent of providers while still providing good performance on a capable backend. On a more pragmatic side, people are already interested and might pay for db-oriented functionalities in QGIS (the very first need is to be able to filter joined tables). It does not have to be the only incentive for design choices, but this is a good opportunity. Then a decision has to be taken. So, if it is hard for you and me to agree :), are there other opinions ? Other arguments for one or the other side ? I don't want to disagree, I just wanted to raise questions / to understand the limitations. I see the demand for this functionality and I see the potential that sqlite virtual tables have to offer. I just wonder what the performance will be like in a scenario where there's a network (with latency for every request) involved. And what it would require to overcome this issue (if there is any). Ok, but this performance issue is not related to the backend choice (SQLite or our own), right ? ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Hi all, this is a very high level topic for me. I will just point out some user/funder needs, and maybe try to describe other strategies exist in GIS softs. As a user coming from both ESRI and Mapinfo world: - The main (and last?) real advantage of MI was its native pseudo-SQL capabilities (with many limitations). Users really often now have a centralized spatialDB, where the administrator can do many things (almost everything in fact), but the classical user has only two possibilies for relation Db treatments: algorithms or duplicate data in sqlite and learn spatial sql (unless their DBA grant them to Postgres... ). Power users are very happy, ex MI users are not in QGIS. - Arcgis have no agregate capabilities over its relations.. Joins are more developped and allow summarize algoithms. To go further, It requires using an external DB system, and it's a pain (they all have limitations, in term of cost or learning curve..). Postgis/sqlite is far away in terms of accessibility Having virtual sqlite tables would allow some MI-like behaviour. Users here would appreciate that a lot (I will still be using postgres underneath for perfs). For more simple use cases, some only want to improve joins by being able to join and output agregate values if multiples tuples join, for simple mapping purposes. I think UI needs boths things, a SQL dialog to create queries - Qspatialite-like - on every table, and some agregate capabilities over joins. What we must avoid is having two different implementations, with differents limitations. Only power users knowing what is really happening underneath will know what function to use, which is bad in UX terms. A comparison is OGR CSV driver versus CSV plugin... User has to know that both tools are differents, behaves differently with a different providers. Nothing in the GUI let you know that except try-fail approach. BTW, I have the feeling you don't disagree at all but that we are digging one of the harder features of a GIS tool. IMHO, that really desserves discussions, prooves of concept. Any other opinions in dev's? Cheers Régis -- View this message in context: http://osgeo-org.1560.x6.nabble.com/QEP-RFC-sqlite-virtual-tables-tp5168850p5170009.html Sent from the Quantum GIS - Developer mailing list archive at Nabble.com. ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer
[Qgis-developer] QEP/RFC sqlite virtual tables
Moving this to a new thread (from 1:many relation... On 22.10.2014 10:38, Hugo Mercier wrote: Well, QgsExpression has been designed to mimick the SQL WHERE clause, so yes I agree. My point is that if you want to extend expression to support joins, aggregates, window functions, etc. you end up recreating SQL, but probably a less powerful one. You can call it mimicking, you can also call it implementing it's own SQL (subset) dialect. Yes, you end up recreating SQL. How powerful it will be is hard to say right now, right? The difference I can see between expressions and SQL is that functions used in expressions may have no direct equivalent SQLite-side. But it can be resolved by using user-defined functions in SQLite (that would just call code from QgsExpression for instance), so that the SQL dialect would be seen as an extension of the current QgsExpression, not something different. :) : I don't see anything that prevents virtual layer queries from being optimized server-side. It has to be taken into account during the implementation of virtual layers, of course, but it is hard to exhibit the precise way it will work now without having an almost complete implementation. I was referring to this discussion here. https://github.com/qgis/QGIS-Enhancement-Proposals/pull/5#issuecomment-5814 8788 Sorry I did not respond again, but I don't think a complete optimization is possible without having deep access to virtual table functionality in sqlite (or having to parse the same SQL on our end and replace parts of it which basically comes close to writing our own engine again). But let's discuss this on the QEP PR instead of inside this ml thread. Just to give you my feeback based on my implementation of SQLite SQL dialect in OGR ( https://github.com/OSGeo/gdal/blob/trunk/gdal/ogr/ogrsf_frmts/sqlite/ogrsqlitevirtualogr.cpp ), I also doubt that you could easily inform the server that a join should be done. AFAIR I could forward simple filtering to the server/backend (things like column_A = 5 AND column_A = 10 AND column_B != 10) by implementing the xFilter virtual method, but Virtual Tables are not informed of joins happening (the sqlite virtual table API hardy exposes SQL at all). So that would indeed require analyzing the SQL on QGIS side. AFAIR, I've just verified that JOIN like t1.col1 = t2.col2 are seen by the virtual table implementation like successive filters t1.col1 = val1, t1.col1 = val2, where val1, val2 are values from t2.col2. And regarding xFilter, even filters with OR etc.. are hidden to the virtual table implementation and evaluated only by SQLite. Thanks Even for your feedback. The idea under virtual layers is that SQLite may be exposed to advanced users if they want to use a complex query on some data sources. It is very interesting to use SQLite to replace existing features from a code point of view because it simplifies implementation. But exposing a SQL query for already existing features such as joins may not be desirable for standard users. So for joins, since the information comes from QGIS (I want to join tables t1 and t2 on this column predicate), the SQL query can be deduced and some other metadata is also known, like the fact that t1 and t2 may be from the very same database. In that case the virtual layer may be created with some metadata saying that this is in fact a join that must be optimized. And then the virtual layer provider can ask the underlying provider (say postgis) if it can process the whole join query. I am not only referring to the current relations implementation here. I don't think aggregate functions need necessarily be based on creating a relation definition in the project file (Just like in a database you also don't need a foreign key constraint to create join). But if I want to create a new layer or an ad-hoc iterator that joins data or makes use of other advanced relational database features, it would be nice if QGIS could forward such requests to the database. Say I want to create a symbology based on the amount of waste within countries, QGIS should be able to execute this request on the database. In this scenario you could likely want to define this request inside an expression (e.g. a virtual column). So basically, I think we need a functionality that is able to work as a database abstraction layer. Forward wherever possible to the db, use a (local) fallback implementation where not. I am not opposed to use SQLite for this. There is no reason for us to reinvent the wheel, the NIH syndrome should not be our driver. But I see some limitations and I would rather prefer to have these properly addressed. And I will be more than happy if you tell me that you convinced the sqlite virtual table developers that they should extend their API to expose the parsed SQL structure and allow to tweak it. Then we have a fallback implementation which is well-tested but we still have the
Re: [Qgis-developer] QEP/RFC sqlite virtual tables
Le 22/10/2014 11:21, Matthias Kuhn a écrit : On 22.10.2014 10:38, Hugo Mercier wrote: Well, QgsExpression has been designed to mimick the SQL WHERE clause, so yes I agree. My point is that if you want to extend expression to support joins, aggregates, window functions, etc. you end up recreating SQL, but probably a less powerful one. You can call it mimicking, you can also call it implementing it's own SQL (subset) dialect. Sorry for my english, mimicking was not a negative term for me. Yes, you end up recreating SQL. How powerful it will be is hard to say right now, right? Sure. But if you want it to be as powerful as other SQL engines, lots of work is needed (parsing, planning, optimizing). Work that has already been done, I think. So for joins, since the information comes from QGIS (I want to join tables t1 and t2 on this column predicate), the SQL query can be deduced and some other metadata is also known, like the fact that t1 and t2 may be from the very same database. In that case the virtual layer may be created with some metadata saying that this is in fact a join that must be optimized. And then the virtual layer provider can ask the underlying provider (say postgis) if it can process the whole join query. I am not only referring to the current relations implementation here. I don't think aggregate functions need necessarily be based on creating a relation definition in the project file (Just like in a database you also don't need a foreign key constraint to create join). Not sure to understand. You mean there would be no need to create a virtual layer just to execute an aggregate ? Exact. But if the virtual layer concept is complete enough, then every single layer can be seen as a special case of a (trivial) virtual layer, then to create an aggregate, you won't need to create a new virtual layer. But if I want to create a new layer or an ad-hoc iterator that joins data or makes use of other advanced relational database features, it would be nice if QGIS could forward such requests to the database. Say I want to create a symbology based on the amount of waste within countries, QGIS should be able to execute this request on the database. In this scenario you could likely want to define this request inside an expression (e.g. a virtual column). So basically, I think we need a functionality that is able to work as a database abstraction layer. Forward wherever possible to the db, use a (local) fallback implementation where not. I agree, but can you be more specific on your example ? How many tables involved ? Which fields ? What kind of joins ? I am not opposed to use SQLite for this. There is no reason for us to reinvent the wheel, the NIH syndrome should not be our driver. But I see some limitations and I would rather prefer to have these properly addressed. And I will be more than happy if you tell me that you convinced the sqlite virtual table developers that they should extend their API to expose the parsed SQL structure and allow to tweak it. Then we have a fallback implementation which is well-tested but we still have the possibility to fine-tune requests for big databases. Splitting the original query in local and remote queries seems to me very hard in the general case. It would require not only the parsed SQL AST, but some information about the execution plan. And the plan may depends on statistics on the data. And even when you know the plan, I think it is still hard to translate. Generally speaking if the user tries to do a very complex query with lots of different data sources, it is hard to guarantee good performances, but I don't see it as a major problem. If you want something complex AND fast, then use SQLite or PostGIS. So, I would vote for optimizations that can be done for simple, defined scenarios, like the join case given by the current join or relation feature of QGIS : no nested queries, no group by, no order by, no CTE, etc. That particular case could be detected when the join is created, and optimizations can be created, like creating a (temporary) postgis layer with sql=the join query. I may be missing something. Are there other simple cases where the optimization is well defined ? To put it differently, I don't see the lack of knowledge about parsed SQL query as a blocker for the implementation of this virtual layer concept. A general cross-database SQL optimizer would need lots of work (in QGIS and/or SQLite) and may come later (if really needed). The other option would be to say we don't need everything SQLite provides, and we won't never need it. So then we can define our own subset of SQL and it would probably be easier to optimize. But I can't see why we would want to limit the features :) it is usually the opposite. ___ Qgis-developer mailing list Qgis-developer@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/qgis-developer