Re: [sqlite] Order of columns in group by statement affects query performance
Do you have any experience with SQLite virtual tables? I guess not. There are 20 issues here: 1) The abstract problem of choosing an Index for optimizing GROUP BY 2) the SQLite implementation (which I was referring to) Ad 1) Any index that covers all the GROUP BY fields is a good index because it allows aggregates to be computed on the fly as opposed to in a temporary table. ORDER BY clause and multiple indices may complicate the matter Ad 2) SQLite attempts to handle virtual tables the same as native tables (which is one of the main reasons we chose SQLite). The VT interface does not allow publication of indexes nor creation of native indexes on virtual tables. The aOrderBy table of the interface implies an ordered list of fields, therefore SQLite would have to call xBestIndex n! times to discover the least costly index to use. It is not unreasonable to assume that in a well designed SQL Statement the GROUP BY clause will be backed up by the necessary index and an identical ORDER BY clause (at least unintentionally by the programmer virtue of laziness resulting in copy-paste of the field list). Thus the aOrderBy array being used for ORDER BY and GROUP BY in the VT interface. http://www.sqlite.org/vtab.html 2.3 The xBestIndex Method SQLite uses the xBestIndex method of a virtual table module to determine the best way to access the virtual table. The xBestIndex method has a prototype like this: int (*xBestIndex)(sqlite3_vtab *pVTab, sqlite3_index_info*); ... Before calling this method, the SQLite core initializes an instance of the sqlite3_index_info structure with information about the query that it is currently trying to process. This information derives mainly from the WHERE clause and **ORDER BY or GROUP BY** clauses of the query, but also from any ON or USING clauses if the query is a join. -Ursprüngliche Nachricht- Von: James K. Lowden [mailto:jklow...@schemamania.org] Gesendet: Donnerstag, 25. April 2013 16:34 An: sqlite-users@sqlite.org Betreff: Re: [sqlite] Order of columns in group by statement affects query performance On Thu, 25 Apr 2013 10:29:34 +0200 Hick Gunter h...@scigames.at wrote: AFAIK SQLite treats GROUP BY the same way as ORDER BY (taken from hints in the virtual table description). That might be so, in some limited sense. It's obviously false in general because they mean different things and have different effects. If you have an index that covers the GROUP BY clause in field order, then aggregate functions need store only the current value; if not, then you need an ephemeral table to hold the aggregate values. Nonsense. The query parser sees GROUP BY A,B. The optimizer sees an index ordered B,A. By permuting the order of the columns in the GROUP BY clause, it finds a match for the index and uses it. Yes, the problem is O(n^2), where n is the number of columns in the GROUP BY, but n is always small; even 7 columns could be checked in less than 50 iterations. --jkl ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users -- Gunter Hick Software Engineer Scientific Games International GmbH Klitschgasse 2 – 4, A - 1130 Vienna, Austria FN 157284 a, HG Wien Tel: +43 1 80100 0 E-Mail: h...@scigames.at This e-mail is confidential and may well also be legally privileged. If you have received it in error, you are on notice as to its status and accordingly please notify us immediately by reply e-mail and then delete this message from your system. Please do not copy it or use it for any purposes, or disclose its contents to any person as to do so could be a breach of confidence. Thank you for your cooperation. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
On Fri, 26 Apr 2013 11:12:15 +0200 Hick Gunter h...@scigames.at wrote: It is not unreasonable to assume that in a well designed SQL Statement the GROUP BY clause will be backed up by the necessary index and an identical ORDER BY clause That is an entirely unreasonable assumption. Order may or not matter. I've often written GROUP BY queries ordered by the aggregate, or to select the maximum. Any index that covers all the GROUP BY fields is a good index because it allows aggregates to be computed on the fly as opposed to in a temporary table. Agreed. I hope it's clear now that covers is order-independent. It so happens, apparently, that the order in which the column names appear in the GROUP BY determine whether or not the index is used. Thats unfortunate, because it makes two equivalent queries perform very differently. --jkl ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
AFAIK SQLite treats GROUP BY the same way as ORDER BY (taken from hints in the virtual table description). If you have an index that covers the GROUP BY clause in field order, then aggregate functions need store only the current value; if not, then you need an ephemeral table to hold the aggregate values. In more detail: IF you have an index that covers the GROUP BY clause in field order, then retrieving rows in index order guarantees that all rows belonging to the same group will be retrieved in one block AND makes the output rows come out sorted too. This allows SQLite to keep current aggregate values in registers. IF you do not have an index that covers the GROUP BY clause, then rows will be retrieved in some deterministic order other than by group. You need to retrieve and update the current aggregate values for the group each row is in. This forces SQLite to keep current aggregate values in an ephemeral table. IF you have an index that covers the GROUP BY clause in any other order, then you still have the guarantee that all rows belonging to the same group will be retrieved together, but the result rows will be ordered in index order and not GROUP BY order. This would probably require code paths presumably shared by GROUP BY and ORDER BY processing to be split. -Ursprüngliche Nachricht- Von: James K. Lowden [mailto:jklow...@schemamania.org] Gesendet: Donnerstag, 25. April 2013 01:55 An: sqlite-users@sqlite.org Betreff: Re: [sqlite] Order of columns in group by statement affects query performance On Wed, 24 Apr 2013 17:46:00 +0100 Simon Slavin slav...@bigfraud.org wrote: On 24 Apr 2013, at 5:14pm, Igor Tandetnik i...@tandetnik.org wrote: Note though that the query doesn't have an ORDER BY clause. It doesn't request rows in any particular order. SQLite could, in principle, reorder columns in GROUP BY to take advantage of the index. I suppose the optimizer just happens to miss this particular opportunity. But the GROUP BY clause has an order: Query 1: SELECT A,B,count(*) from tableTest group by A,B Query 2: SELECT A,B,count(*) from tableTest group by B,A The order in which the columns appear syntactically in the GROUP BY clause is meaningless in SQL. Igor is correct that the query processor could use any index beginning with B,A or A,B, should it so choose. --jkl ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users -- Gunter Hick Software Engineer Scientific Games International GmbH Klitschgasse 2 – 4, A - 1130 Vienna, Austria FN 157284 a, HG Wien Tel: +43 1 80100 0 E-Mail: h...@scigames.at This e-mail is confidential and may well also be legally privileged. If you have received it in error, you are on notice as to its status and accordingly please notify us immediately by reply e-mail and then delete this message from your system. Please do not copy it or use it for any purposes, or disclose its contents to any person as to do so could be a breach of confidence. Thank you for your cooperation. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
On Thu, 25 Apr 2013 10:29:34 +0200 Hick Gunter h...@scigames.at wrote: AFAIK SQLite treats GROUP BY the same way as ORDER BY (taken from hints in the virtual table description). That might be so, in some limited sense. It's obviously false in general because they mean different things and have different effects. If you have an index that covers the GROUP BY clause in field order, then aggregate functions need store only the current value; if not, then you need an ephemeral table to hold the aggregate values. Nonsense. The query parser sees GROUP BY A,B. The optimizer sees an index ordered B,A. By permuting the order of the columns in the GROUP BY clause, it finds a match for the index and uses it. Yes, the problem is O(n^2), where n is the number of columns in the GROUP BY, but n is always small; even 7 columns could be checked in less than 50 iterations. --jkl ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
2013/4/25 James K. Lowden jklow...@schemamania.org Nonsense. The query parser sees GROUP BY A,B. The optimizer sees an index ordered B,A. By permuting the order of the columns in the GROUP BY clause, it finds a match for the index and uses it. Yes, the problem is O(n^2), where n is the number of columns in the GROUP BY, but n is always small; even 7 columns could be checked in less than 50 iterations. I believe its O(n!), but still doable for small n. I don't know the inner workings of the query optimizer but mabye instead of asking/check for a index of every permutation of the columns in the group by, it could just check if an index exists which covers all columns (even the sorting order doesn't matter). (the virtual table api needs an addition for that to work) ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
On Thu, Apr 25, 2013 at 10:29:34AM +0200, Hick Gunter scratched on the wall: AFAIK SQLite treats GROUP BY the same way as ORDER BY (taken from hints in the virtual table description). They're not the same clause, they don't do the same thing. Now, it is true that most database systems implement the first step of a GROUP BY by sorting the query using semantics that are similar to ORDER BY. That way all of the rows in a related group are next to each other, and they're easier to process. I assume SQLite does the same thing. It is, however, as they say, an implementation detail. IF you have an index that covers the GROUP BY clause in any other order, then you still have the guarantee that all rows belonging to the same group will be retrieved together, but the result rows will be ordered in index order and not GROUP BY order. Except there is no such thing as GROUP BY order. SQL Golden Rule: If there is no ORDER BY, the rows have no order. According to SQL, neither the groups, nor the rows within a group (as they are fed into aggregates) have a defined order. Any query that makes assumptions about the ordered result of a GROUP BY is broken. Use the out-of-order index. -j -- Jay A. Kreibich J A Y @ K R E I B I.C H Intelligence is like underwear: it is important that you have it, but showing it to the wrong people has the tendency to make them feel uncomfortable. -- Angela Johnson ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
On Thu, Apr 25, 2013 at 05:08:04PM +0200, Daniel Winter scratched on the wall: 2013/4/25 James K. Lowden jklow...@schemamania.org Nonsense. The query parser sees GROUP BY A,B. The optimizer sees an index ordered B,A. By permuting the order of the columns in the GROUP BY clause, it finds a match for the index and uses it. Yes, the problem is O(n^2), where n is the number of columns in the GROUP BY, but n is always small; even 7 columns could be checked in less than 50 iterations. I believe its O(n!), but still doable for small n. I don't know the inner workings of the query optimizer but mabye instead of asking/check for a index of every permutation of the columns in the group by, it could just check if an index exists which covers all columns (even the sorting order doesn't matter). (the virtual table api needs an addition for that to work) Permutations are O(N!), but that's not really what you want. Given a set of GROUP BY terms you want, generally, the index with the most terms in any initial order. You don't need a full match for the index to be a win. For example, GROUP BY A,B,C,D,E is likely to get a performance boost from an index on (A,D,B) and, *in general*, that should be a bigger win than an index on (B,C). Of course, since this is a query optimizer, there are always edge cases... For example, if there is an index over (E) that has 99% unique values, it is likely a better choice than (A,D,B)... it depends on the distribution of the index. Similarly, if any GROUP BY term maps to a unique index... boom, you're done. As with most things having to do with query optimization, the problem quickly explodes. On the other hand, SQLite must already have assumptions about index costs (with or without ANALYZE), so at least there's an existing set of weights and assumptions to work from. -j -- Jay A. Kreibich J A Y @ K R E I B I.C H Intelligence is like underwear: it is important that you have it, but showing it to the wrong people has the tendency to make them feel uncomfortable. -- Angela Johnson ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
On 25 Apr 2013, at 4:23pm, Jay A. Kreibich j...@kreibi.ch wrote: Except there is no such thing as GROUP BY order. SQL Golden Rule: If there is no ORDER BY, the rows have no order. According to SQL, neither the groups, nor the rows within a group (as they are fed into aggregates) have a defined order. Any query that makes assumptions about the ordered result of a GROUP BY is broken. Use the out-of-order index. GROUP BY on multiple columns means that the values in all those columns have to be the same for the rows to be included in the same GROUP. It says nothing about the order those groups should appear in in the results of the SELECT. Okay. So adding this to what went by upthread, I was wrong. Column order in the GROUP BY clause doesn't matter. Therefore the upthread comment that GROUP BY A,B means exactly the same as GROUP BY B,A is correct. So if there's an index which features those columns in any order it can be used, not matter what order the columns appear in in the GROUP BY clause. So it was perfectly reasonable for the OP to wonder why an index was used for one order but not another. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Order of columns in group by statement affects query performance
Hello, I am using sqlite3 (3.7.15.2 at the moment) in a project. I discovered that the order of columns in a group by affects the performance of a query. Is this expected? For example: Table: Column A int, Column B int, Column C int One Index: A,B (combined) Query 1: SELECT A,B,count(*) from tableTest group by A,B Query 2: SELECT A,B,count(*) from tableTest group by B,A Query 1 will use the index, while query 2 will not. (which makes Query 1 a lot faster with bigger tables). Both querys will result with the same data. I do not really understand why it doesn't use the index for both querys. I am also using a virtual table (based on a two dimensional array) (very nice feature btw, sad that it is so unknown). I am seeing the same behaviour there. It will only use the index if it has the exact same order as the columns in the group by. Thanks, Daniel ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
*Daniel Winter wrote:0* I discovered that the order of columns in a group by affects the performance of a query. Is this expected? Yes. For example: Table: Column A int, Column B int, Column C int One Index: A,B (combined) Query 1: SELECT A,B,count(*) from tableTest group by A,B Query 2: SELECT A,B,count(*) from tableTest group by B,A Query 1 will use the index, while query 2 will not. (which makes Query 1 a lot faster with bigger tables). Both querys will result with the same data. I do not really understand why it doesn't use the index for both querys. The index has row references pre-sorted according to the criteria specified for it. That is why it can be used to speed up a query which sorts by those criteria. There is no mystery as to why it does not help to speed up other queries which do not sort by those criteria. Looking at the query plan would show the index used for your first query and not for the second. A similar effect would be seen with joins which are typically processed as a sifted merge of sorted components. Best regards, -- Larry ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
On 4/24/2013 11:49 AM, Larry Brasfield wrote: *Daniel Winter wrote:0* Table: Column A int, Column B int, Column C int One Index: A,B (combined) Query 1: SELECT A,B,count(*) from tableTest group by A,B Query 2: SELECT A,B,count(*) from tableTest group by B,A Query 1 will use the index, while query 2 will not. (which makes Query 1 a lot faster with bigger tables). Both querys will result with the same data. I do not really understand why it doesn't use the index for both querys. The index has row references pre-sorted according to the criteria specified for it. That is why it can be used to speed up a query which sorts by those criteria. Note though that the query doesn't have an ORDER BY clause. It doesn't request rows in any particular order. SQLite could, in principle, reorder columns in GROUP BY to take advantage of the index. I suppose the optimizer just happens to miss this particular opportunity. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
On 24 Apr 2013, at 5:14pm, Igor Tandetnik i...@tandetnik.org wrote: Note though that the query doesn't have an ORDER BY clause. It doesn't request rows in any particular order. SQLite could, in principle, reorder columns in GROUP BY to take advantage of the index. I suppose the optimizer just happens to miss this particular opportunity. But the GROUP BY clause has an order: Query 1: SELECT A,B,count(*) from tableTest group by A,B Query 2: SELECT A,B,count(*) from tableTest group by B,A which is pretty much the same as having two different ORDER BY clauses. An index which is ideal for one of those is not ideal for the other. I don't find it weird that the optimiser would decide to use an index for one of them but not the other. To the original poster: do an ANALYZE, then see if you get different timings afterwards. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Order of columns in group by statement affects query performance
On Wed, 24 Apr 2013 17:46:00 +0100 Simon Slavin slav...@bigfraud.org wrote: On 24 Apr 2013, at 5:14pm, Igor Tandetnik i...@tandetnik.org wrote: Note though that the query doesn't have an ORDER BY clause. It doesn't request rows in any particular order. SQLite could, in principle, reorder columns in GROUP BY to take advantage of the index. I suppose the optimizer just happens to miss this particular opportunity. But the GROUP BY clause has an order: Query 1: SELECT A,B,count(*) from tableTest group by A,B Query 2: SELECT A,B,count(*) from tableTest group by B,A The order in which the columns appear syntactically in the GROUP BY clause is meaningless in SQL. Igor is correct that the query processor could use any index beginning with B,A or A,B, should it so choose. --jkl ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users