Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
>Other "am I right in thinking this" question: INTERSECT is only going >to be viable when comparing full records, correct? If you're looking >to filter table A by whether its primary key is also a primary key >for table B, but ignoring the other fields in both, then INTERSECT >becomes not an option, or at least starts making the query more >complex/ugly... correct? No. The data must merely be the same shape (order, number of columns): SELECT c1, c2, c3, c8 from t1 INTERSECT SELECT x5, g4, q7, b3 from t2; each of t1 and t2 may have fields named a1 a2 a4 ... a26 b1 b2 b3 ... b26 ... z1 z2 z3 ... z26 (for a total of 676 fields per table). The comparison is of resulting matrices, not underlying tables. Of course, if you do "SELECT * from table1;" the * is merely "syntactic sugar" (a short way of spelling) the complete, in declaration order, list of explicit columns. (So, if and only if the ROWID is an "explicit column" is it used in the intersection, otherwise it is not -- and concomitantly if you explicitly list the columns to intersect, then the rowid participates if and only if you have included it in the list of data to intersect.) INTERSECT / EXCEPT / UNION are matrix operations. The RHS (select statement before the operator) and LHS (select statement after the operator) provide the two matrices on which the operation is performed. Matrixes do not have column names, merely ordinal positions (column 1, column 2, column 3 and so forth). Similarly these operations do not care about column names, merely that the order (number of columns in each matrix) is the same. Comparisons are done by ordinal position of the item in the row. The output is a matrix. It only appears "row at a time" because of the primitive method of operation of (no matter how advanced) computers which can only perform operations as a serial sequence of steps. For "user convenience" the output column names are set to the RHS column names. --- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. >-Original Message- >From: sqlite-users [mailto:sqlite-users- >boun...@mailinglists.sqlite.org] On Behalf Of David Raymond >Sent: Thursday, 7 September, 2017 10:31 >To: SQLite mailing list >Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency > >Although it may not translate as well to the more complex examples, >would you also consider adding the IN operator to your tests? I found >for example that "select v from t1 where v in t2;" did even better >than the join or the intersect. > > > > >-Original Message- >From: sqlite-users [mailto:sqlite-users- >boun...@mailinglists.sqlite.org] On Behalf Of R Smith >Sent: Thursday, September 07, 2017 8:06 AM >To: sqlite-users@mailinglists.sqlite.org >Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency > > >On 2017/09/07 3:43 AM, Keith Medcalf wrote: >> Try the same test using 147 columns in each table. > >Exactly the plan for this weekend :) > >> 1 column is rather trivial. Even a kindergarten kid could do it in >no time using crayons and the wall. > >So? That is non-sequitur, I am sure given enough crayons, wall-space >and >time, a kindergarten kid can do it with 147 columns too. That says >exactly nothing about the possible efficiencies of different methods. >If >however the 1-columness of the test gets somehow advantaged by being >the >PK (as Nico pointed out) or real world data such as TEXT entries sort >slower than INTs, then it might affect it, so the 147 column tests >will >tell. > > >> In other words except in very trivial cases (like having only one >column that is not nullable) it will be very difficult to write a >"correct" JOIN or correlated subquery that emulates an INTERSECT. > >Well I agree, but it is those trivial cases that are of interest >here, >and if there is a general JOIN optimization to be had. The INTERSECT >test merely served as the catalyst to put us on the trail of the >possible JOIN optimization, if there is even an optimization to be >had >(it might yet be a wild goose chase, which you seem to have your >money >on, so watch this space, I'll graciously accept your "told ya!" later >after testing). > > >Cheers, >Ryan > >___ >sqlite-users mailing list >sqlite-users@mailinglists.sqlite.org >http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >___ >sqlite-users mailing list >sqlite-users@mailinglists.sqlite.org >http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
Expanding things for when you get bored, in addition to JOIN vs INTERSECT vs IN I'd also be interested in JOIN vs EXCEPT vs NOT IN, as I tend to do more exclusion rather than intersection. The straight up "IN tablename" may be SQLite only, but it also supports IN (subquery) so "select v from t1 where v in (select v from t2)" also works, although the "IN tablename" is so much cleaner to read in my opinion. create table statesToSkip ( state text primary key collate nocase ) without rowid; select foo from bar where state not in statesToSkip. vs select foo from bar where state not in (select state from statesToSkip) vs select bar.foo from bar left outer join statesToSkip on (bar.state = statesToSkip.state) where statesToSkip.state is null; -Original Message- From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On Behalf Of R Smith Sent: Thursday, September 07, 2017 3:51 PM To: sqlite-users@mailinglists.sqlite.org Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency On 2017/09/07 6:31 PM, David Raymond wrote: > Although it may not translate as well to the more complex examples, would you > also consider adding the IN operator to your tests? I found for example that > "select v from t1 where v in t2;" did even better than the join or the > intersect. Will do. The only thing I have somewhat against that specific query form is that it doesn't work in other engines (seems to not be standard). But for optimization in SQLite specifically that's perfect. > Other "am I right in thinking this" question: INTERSECT is only going to be > viable when comparing full records, correct? If you're looking to filter > table A by whether its primary key is also a primary key for table B, but > ignoring the other fields in both, then INTERSECT becomes not an option, or > at least starts making the query more complex/ugly... correct? INTERSECT will happily match however many columns you desire (and specify), there is no need to match full records or single keys specifically. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On Thu, Sep 07, 2017 at 09:51:07PM +0200, R Smith wrote: > INTERSECT will happily match however many columns you desire (and specify), > there is no need to match full records or single keys specifically. But the two queries on either side of the set operator must have the same number of columns (and in strongly-typed RDBMSes, the same types). That's a huge constraint. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On 2017/09/07 6:31 PM, David Raymond wrote: Although it may not translate as well to the more complex examples, would you also consider adding the IN operator to your tests? I found for example that "select v from t1 where v in t2;" did even better than the join or the intersect. Will do. The only thing I have somewhat against that specific query form is that it doesn't work in other engines (seems to not be standard). But for optimization in SQLite specifically that's perfect. Other "am I right in thinking this" question: INTERSECT is only going to be viable when comparing full records, correct? If you're looking to filter table A by whether its primary key is also a primary key for table B, but ignoring the other fields in both, then INTERSECT becomes not an option, or at least starts making the query more complex/ugly... correct? INTERSECT will happily match however many columns you desire (and specify), there is no need to match full records or single keys specifically. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
Although it may not translate as well to the more complex examples, would you also consider adding the IN operator to your tests? I found for example that "select v from t1 where v in t2;" did even better than the join or the intersect. Other "am I right in thinking this" question: INTERSECT is only going to be viable when comparing full records, correct? If you're looking to filter table A by whether its primary key is also a primary key for table B, but ignoring the other fields in both, then INTERSECT becomes not an option, or at least starts making the query more complex/ugly... correct? -Original Message- From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On Behalf Of R Smith Sent: Thursday, September 07, 2017 8:06 AM To: sqlite-users@mailinglists.sqlite.org Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency On 2017/09/07 3:43 AM, Keith Medcalf wrote: > Try the same test using 147 columns in each table. Exactly the plan for this weekend :) > 1 column is rather trivial. Even a kindergarten kid could do it in no time > using crayons and the wall. So? That is non-sequitur, I am sure given enough crayons, wall-space and time, a kindergarten kid can do it with 147 columns too. That says exactly nothing about the possible efficiencies of different methods. If however the 1-columness of the test gets somehow advantaged by being the PK (as Nico pointed out) or real world data such as TEXT entries sort slower than INTs, then it might affect it, so the 147 column tests will tell. > In other words except in very trivial cases (like having only one column that > is not nullable) it will be very difficult to write a "correct" JOIN or > correlated subquery that emulates an INTERSECT. Well I agree, but it is those trivial cases that are of interest here, and if there is a general JOIN optimization to be had. The INTERSECT test merely served as the catalyst to put us on the trail of the possible JOIN optimization, if there is even an optimization to be had (it might yet be a wild goose chase, which you seem to have your money on, so watch this space, I'll graciously accept your "told ya!" later after testing). Cheers, Ryan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On Wed, Sep 06, 2017 at 07:43:07PM -0600, Keith Medcalf wrote: > Try the same test using 147 columns in each table. > > 1 column is rather trivial. Even a kindergarten kid could do it in no > time using crayons and the wall. > > [...] > > In other words except in very trivial cases (like having only one > column that is not nullable) it will be very difficult to write a > "correct" JOIN or correlated subquery that emulates an INTERSECT. Yup. But that doesn't mean that the engine couldn't internally build a result-set from the query without some filtering JOIN, then implement the same strategy as an INTERSECT. You can't do this in SQL if the filter table has different shape than the result set, but the engine might be able to do it. On the other hand, building a complete result set first is... not online behavior. If the result set size is enormous, then the INTERSECT approach is going to make the user very unhappy! I do think OP's tests point out a case where SQLite3 is pessimally picking table scan over covering index scan... ...though scanning the index means that there will be no rowid column in the result, which might actually be a compatibility issue when using rowid tables, so maybe SQLite3 is doing exactly the right thing? I don't think that pessimization is too consequential as users can improve the situation by adding ORDER BY clauses or using WITHOUT ROWID. Nico -- ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On 2017/09/07 3:43 AM, Keith Medcalf wrote: Try the same test using 147 columns in each table. Exactly the plan for this weekend :) 1 column is rather trivial. Even a kindergarten kid could do it in no time using crayons and the wall. So? That is non-sequitur, I am sure given enough crayons, wall-space and time, a kindergarten kid can do it with 147 columns too. That says exactly nothing about the possible efficiencies of different methods. If however the 1-columness of the test gets somehow advantaged by being the PK (as Nico pointed out) or real world data such as TEXT entries sort slower than INTs, then it might affect it, so the 147 column tests will tell. In other words except in very trivial cases (like having only one column that is not nullable) it will be very difficult to write a "correct" JOIN or correlated subquery that emulates an INTERSECT. Well I agree, but it is those trivial cases that are of interest here, and if there is a general JOIN optimization to be had. The INTERSECT test merely served as the catalyst to put us on the trail of the possible JOIN optimization, if there is even an optimization to be had (it might yet be a wild goose chase, which you seem to have your money on, so watch this space, I'll graciously accept your "told ya!" later after testing). Cheers, Ryan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
Try the same test using 147 columns in each table. 1 column is rather trivial. Even a kindergarten kid could do it in no time using crayons and the wall. And of course the output of INTERSECT is ordered. It uses a sorter to perform the intersection. And of course the output is distinct, it uses a sorter to perform the intersection. In other words, select ... a bunch of columns ... from table1 intersect select ... an eual number of bunch of columns ... from table2 is equivalent to select ... the bunch of columns ... from table1 where exists (select * from table2 where (for each column position in table 2 equals that column position from table1, plus of course all the added stuff needed to handle nullable columns)) group by ... the bunch of columns ...; In other words except in very trivial cases (like having only one column that is not nullable) it will be very difficult to write a "correct" JOIN or correlated subquery that emulates an INTERSECT. --- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. >-Original Message- >From: sqlite-users [mailto:sqlite-users- >boun...@mailinglists.sqlite.org] On Behalf Of R Smith >Sent: Wednesday, 6 September, 2017 14:58 >To: sqlite-users@mailinglists.sqlite.org >Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency > >On 2017/09/06 8:26 PM, Nico Williams wrote: >> On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote: >>> -- Another interesting thing to note: The INTERSECT test produces >ORDERED >>> -- output, which suggests that an ORDER-BY addition to the query >would >>> -- favour the INTERSECT method. >> Nothing about INTERSECT requires it to produce ordered output. > >No, and it was not suggested, it was just noted that it does, >suggesting >that it could be the more performant choice when adding an ORDER BY >clause, which turned out to not only be true in terms of being the >better choice, but also that it itself sped up by simply adding the >ORDER BY clause as was demonstrated in Test 6. > >> Nothing about the JOIN case makes it not possible to produce >ordered >> output by accident. > >Yet it doesn't seem to by accident, which would suggest that an ORDER >BY >clause when added to the JOIN statements would incur an additional >time >penalty for having to actually order the results - Yet, as again >demonstrated in Test 6, the ORDER BY actually sped up the JOIN query >too >(perhaps via forcing the Index earlier or used in a different way) - >which was most interesting, and, as you noted, there is nothing about >the JOIN that precludes it from having ordered output, so this >optimization might be worthwhile. > >> You'll want to re-measure with an ORDER BY added. > >I did. It was done in Test 6. It showed significantly interesting >results. Was my explanation lacking in clarity or did it fall down >the >TLDR; rabbit hole? :) > > >> In any case, this is quite interesting. Many uses of JOIN are not >> merely to filter results, but to construct joined result rows -- >such >> uses of JOIN cannot be optimized by using INTERSECT. But for >> filter-uses of JOIN... this might be a useful optimization for the >> engine to learn. > >I agree, and not only the INTERSECT optimization but the tests >suggest >adding a silent ORDER BY would also be an optimization, though not >sure >if the effort-to-pleasure ratio is low enough yet. Perhaps if re- >doing >the tests with tables using several more non-Integer columns to see >if >the optimization could be generalized across all kinds of data in >some >way. I might pursue this later when I have some time. > > >Cheers, >Ryan >___ >sqlite-users mailing list >sqlite-users@mailinglists.sqlite.org >http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On Wed, Sep 06, 2017 at 11:54:35PM +0200, R Smith wrote: > It's still remarkable that in both tests 5 and 6 I've used the very same PK > setup, yet Test 6 was significantly faster with the added ORDER BY clause. > In tests 1 through 4 I did not use a PK at all, just plain INT data field, > but then I did not test the ORDER BY in those tests. It's an optimizer issue. It didn't occur to the optimizer that scanning a covering index was better than scanning the table because the covering index has (in this case, and always, for rowid tables anyways) strictly less contents to read and decode. Scanning the covering index has the happy side-effect (if you wanted if) of producing ordered results and making an equivalent ORDER BY free. (Whereas scanning the table will produce results in rowid order, which is almost certainly not useful unless you explicitly wanted an INTEGER PRIMARY KEY.) Note that INTERSECT could have used a hash table, thus producing unordered results (most likely). But SQLite3 only knows b-trees. All of this explains the accidental ordering / non-ordering. And also why you shouldn't count on it: it's all implementaton details! But just because you know to add an ORDER BY doesn't mean you shouldn't think to make it match some suitable index... The optimizer is nice, but you still have to think a little bit like an optimizer yourself :( > It might turn out to be a wild goose chase, but that will be easily evident > when testing without the PK and with more realistic real-world data. I'll do > that this weekend. You can't have a PK-less table -- SQLite3 always want some PK, even if it's a hidden rowid column. WITHOUT ROWID tables make this clearer: sqlite> create table t(a text) without rowid; Error: PRIMARY KEY missing on table t Adding an explicit PK implicitly added the covering index that sped up the JOIN (once you forced the optimizer to use it). But you should just always have had an explicit PK and WITHOUT ROWID. You still found something interesting about using JOINs to filter result sets (as opposed to adding columns to the the result set). Something to keep in mind... I do a lot of queries where some of the JOINed tables are used only for filtering. It's not always possible to convert such queries to INTERSECT, but it might be possible for SQLite3 to learn how to perform the equivalent optimization internally, and when to do it. Nico -- ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On 2017/09/06 11:17 PM, Nico Williams wrote: If you'll redo this I'd urge you to use WITHOUT ROWIDS. First, that's almost always the right thing to do anyways. Second, it won't perform worse but likely will perform better. Third, write performance definitely should improve with WITHOUT ROWIDS. Fourth, I think users are starting to use WITHOUT ROWIDS more, so testing that seems more useful. Nico All good points which I will definitely heed. I will of course test both types of tables but ensure that when using row_id tables I do not inadvertently have possible PK short-circuiting of query plans. It's still remarkable that in both tests 5 and 6 I've used the very same PK setup, yet Test 6 was significantly faster with the added ORDER BY clause. In tests 1 through 4 I did not use a PK at all, just plain INT data field, but then I did not test the ORDER BY in those tests. It might turn out to be a wild goose chase, but that will be easily evident when testing without the PK and with more realistic real-world data. I'll do that this weekend. Cheers, Ryan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On Wed, Sep 06, 2017 at 10:57:41PM +0200, R Smith wrote: > On 2017/09/06 8:26 PM, Nico Williams wrote: > >On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote: > >>-- Another interesting thing to note: The INTERSECT test produces ORDERED > >>-- output, which suggests that an ORDER-BY addition to the query would > >>-- favour the INTERSECT method. > >Nothing about INTERSECT requires it to produce ordered output. > > No, and it was not suggested, it was just noted that it does, suggesting > that it could be the more performant choice when adding an ORDER BY clause, > which turned out to not only be true in terms of being the better choice, > but also that it itself sped up by simply adding the ORDER BY clause as was > demonstrated in Test 6. I point this out only because users should know not to assume result set order without an ORDER BY. > >Nothing about the JOIN case makes it not possible to produce ordered > >output by accident. > > Yet it doesn't seem to by accident, which would suggest that an ORDER BY > clause when added to the JOIN statements would incur an additional time See my explanation below. > penalty for having to actually order the results - Yet, as again > demonstrated in Test 6, the ORDER BY actually sped up the JOIN query too > (perhaps via forcing the Index earlier or used in a different way) - which > was most interesting, and, as you noted, there is nothing about the JOIN > that precludes it from having ordered output, so this optimization might be > worthwhile. > > >You'll want to re-measure with an ORDER BY added. > > I did. It was done in Test 6. It showed significantly interesting results. > Was my explanation lacking in clarity or did it fall down the TLDR; rabbit > hole? :) I saw it. I think the ORDER BY helped the JOIN because it caused SQLite3 to scan a covering index (the primary) key instead of scanning the table. That without it SQLite3 didn't use that index is rather inefficient, though it may not be a win in real-world use-cases to fix that. Of course, IF you had used WITHOUT ROWIDs you would have found (I'm sure) that the JOIN also produced ordered results by default and was as fast as in your 6th test. In fact, INTERSECT does an implicit ordering step by building a b-tree that the JOIN with the index scan optimization does not have to build at all, so JOIN has a leg up on INTERSECT in that sense. > I agree, and not only the INTERSECT optimization but the tests suggest > adding a silent ORDER BY would also be an optimization, though not sure if I think here the ORDER BY merely forced SQLite3 to pick the more efficient query plan, and that it's probably a (rather minor) optimizer bug that it didn't do so to begin with without the ORDER BY. > the effort-to-pleasure ratio is low enough yet. Perhaps if re-doing the > tests with tables using several more non-Integer columns to see if the > optimization could be generalized across all kinds of data in some way. I > might pursue this later when I have some time. If you'll redo this I'd urge you to use WITHOUT ROWIDS. First, that's almost always the right thing to do anyways. Second, it won't perform worse but likely will perform better. Third, write performance definitely should improve with WITHOUT ROWIDS. Fourth, I think users are starting to use WITHOUT ROWIDS more, so testing that seems more useful. Nico -- ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On 2017/09/06 8:26 PM, Nico Williams wrote: On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote: -- Another interesting thing to note: The INTERSECT test produces ORDERED -- output, which suggests that an ORDER-BY addition to the query would -- favour the INTERSECT method. Nothing about INTERSECT requires it to produce ordered output. No, and it was not suggested, it was just noted that it does, suggesting that it could be the more performant choice when adding an ORDER BY clause, which turned out to not only be true in terms of being the better choice, but also that it itself sped up by simply adding the ORDER BY clause as was demonstrated in Test 6. Nothing about the JOIN case makes it not possible to produce ordered output by accident. Yet it doesn't seem to by accident, which would suggest that an ORDER BY clause when added to the JOIN statements would incur an additional time penalty for having to actually order the results - Yet, as again demonstrated in Test 6, the ORDER BY actually sped up the JOIN query too (perhaps via forcing the Index earlier or used in a different way) - which was most interesting, and, as you noted, there is nothing about the JOIN that precludes it from having ordered output, so this optimization might be worthwhile. You'll want to re-measure with an ORDER BY added. I did. It was done in Test 6. It showed significantly interesting results. Was my explanation lacking in clarity or did it fall down the TLDR; rabbit hole? :) In any case, this is quite interesting. Many uses of JOIN are not merely to filter results, but to construct joined result rows -- such uses of JOIN cannot be optimized by using INTERSECT. But for filter-uses of JOIN... this might be a useful optimization for the engine to learn. I agree, and not only the INTERSECT optimization but the tests suggest adding a silent ORDER BY would also be an optimization, though not sure if the effort-to-pleasure ratio is low enough yet. Perhaps if re-doing the tests with tables using several more non-Integer columns to see if the optimization could be generalized across all kinds of data in some way. I might pursue this later when I have some time. Cheers, Ryan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote: > -- Another interesting thing to note: The INTERSECT test produces ORDERED > -- output, which suggests that an ORDER-BY addition to the query would > -- favour the INTERSECT method. Nothing about INTERSECT requires it to produce ordered output. Nothing about the JOIN case makes it not possible to produce ordered output by accident. You'll want to re-measure with an ORDER BY added. In any case, this is quite interesting. Many uses of JOIN are not merely to filter results, but to construct joined result rows -- such uses of JOIN cannot be optimized by using INTERSECT. But for filter-uses of JOIN... this might be a useful optimization for the engine to learn. Nico -- ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] JOIN vs. INTERSECT speed and efficiency
Hi all, For those interested, after a recent thread from a poster called Joe asking about the most efficient way to find values that coincide from two separate tables, a response from Clemens Ladisch and a further elaboration from myself suggested the following: SELECT v FROM t1 INTERSECT SELECT v FROM t2 I was going to mention that that INTERSECT is simply equivalent to a JOIN on v (if there are no duplicate entries) of the form: SELECT v FROM t1 JOIN t2 ON t2.v = t1.v But then the question arose: Which is more efficient? - So at the time of posting I refrained from mentioning it, but it kept in my mind, until I decided to try and answer the question. Next step was to set up an experiment to mimic the above and test it. *Experiment* - If the reader is interested, or would like to compare and contrast with possibly different DB engines, versions, platforms or settings, read on, else just skip ahead to the Conclusion bit. Setup: - Using SQLite. - Ran on my slowest (to magnify efficiency difference) notebook (Win10 64bit) - using sqlitespeed (http://www.sqlc.rifin.co.za/) but you can use any system since the objective measurement here is qualitative difference ratio, not quantitative direct units. - with the standard pre-compiled 32-bit SQLite DLL available from the SQLite download page (https://sqlite.org/download.html) - test DLL version: 3.17.0 Tests: I will post the first test script in full with added annotation to explain the rationale and show the full working, but the subsequent scripts I will only post the money-shots and avoid the repetitive parts. Note that the test script is run in a transaction which is not shown explicitly and at the end rolled back to avoid differences in tests on account of the file content changing during the tests. Test 1: -- SQLite version 3.17.0 [ Release: 2017-02-13 ] on SQLitespeed version 2.0.2.4. -- -- This first table holds Parameters for testing. -- idxCount specifies how many entries to make in the test tables. (i.e. Number of added rows). -- rndDiv is the modality chosen for the random values to control the magnitude of random values -- so that the ratio of expected matching values to value-count can be controlled. CREATE TABLE p ( idxCount INT, rndDiv INT ); -- Here we add the Test parameter values. This is the only section that changes between -- the first 4 tests. -- This test has a row-count of 10 (One-Hundred-Thousand) and a modality of 1000 times the -- idxCount so that in all the rows, we expect roughly only about 0.1% of matches. -- Note that if every value in table 1 has a 1-in-1000 chance of having a matching valuein -- table 2, then for 100,000 rows there should be ~100 matches which is ~0.1% of the row-count. -- This test was picked as the full example since it produces the least amount of rows. INSERT INTO p(idxCount, rndDiv) VALUES (10, 1000 * 10); -- Confirm the parameters. SELECT * FROM p; -- idxCount | rndDiv -- | -- 10 | 1 -- Following are the two test tables t1 and t2 each containing one column v to hold the random values. CREATE TABLE t1 ( v INT ); CREATE TABLE t2 ( v INT ); -- Here we fill the two test tables according to our parameters above. WITH R(idx,Val) AS ( SELECT 0, 1000 UNION ALL SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount ) INSERT INTO t1(v) SELECT Val FROM R ; WITH R(idx,Val) AS ( SELECT 0, 1000 UNION ALL SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount ) INSERT INTO t2(v) SELECT Val FROM R ; -- Test 1 - using JOIN. -- Note that this is the second iteration of tests. In the first iteration I had the INTERSECT -- query before the JOIN query, but I swapped it around for the second iteration because -- the JOIN was generally faster for low match counts so I wanted to ensure neither test -- gained an advantage from being second due to some or other caching by the Query planner. SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v; -- v -- -- 1000 -- 2312703 -- 87720925 -- 29736409 -- 62527166 -- 25171143 -- 24168552 -- 86449735 -- 83544235 -- 45671286 -- 69343788 -- 42394827 -- 92142603 -- 87106564 -- 4593574 -- 6914348 -- 16358938 -- 12568863 -- 20105830 -- 91354724 -- 87992157 -- 17605134 -- 28584588 -- 78633251 -- 98955905 -- 19979768 -- 20956231 -- 30819730 -- 93942875 -- 45346494 -- 96346064 -- 32224203 -- 89622511 -- 39267531 -- 3116133 -- 31172079 -- 87828771 -- 82931503 -- 89108957 -- 80067973 -- 89366000 -- 68319117 -- 37802556 -- 64391927 -- 84515054 -- 11071461 -- 40682706 -- 78441313 -- 17977211 -- 659811 -- 14504321 -- 57479870 -- 44134958 -- 94642155 -- 37520503 --