Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-07 Thread Keith Medcalf

>Other "am I right in thinking this" question: INTERSECT is only going
>to be viable when comparing full records, correct? If you're looking
>to filter table A by whether its primary key is also a primary key
>for table B, but ignoring the other fields in both, then INTERSECT
>becomes not an option, or at least starts making the query more
>complex/ugly... correct?

No.  The data must merely be the same shape (order, number of columns):

SELECT c1, c2, c3, c8 from t1
INTERSECT
SELECT x5, g4, q7, b3 from t2;

each of t1 and t2 may have fields named a1 a2 a4 ... a26 b1 b2 b3 ... b26 ... 
z1 z2 z3 ... z26 (for a total of 676 fields per table).

The comparison is of resulting matrices, not underlying tables.  Of course, if 
you do "SELECT * from table1;" the * is merely "syntactic sugar" (a short way 
of spelling) the complete, in declaration order, list of explicit columns.  
(So, if and only if the ROWID is an "explicit column" is it used in the 
intersection, otherwise it is not -- and concomitantly if you explicitly list 
the columns to intersect, then the rowid participates if and only if you have 
included it in the list of data to intersect.)

INTERSECT / EXCEPT / UNION are matrix operations.  The RHS (select statement 
before the operator) and LHS (select statement after the operator) provide the 
two matrices on which the operation is performed.  Matrixes do not have column 
names, merely ordinal positions (column 1, column 2, column 3 and so forth).  
Similarly these operations do not care about column names, merely that the 
order (number of columns in each matrix) is the same.  Comparisons are done by 
ordinal position of the item in the row.  The output is a matrix.  It only 
appears "row at a time" because of the primitive method of operation of (no 
matter how advanced) computers which can only perform operations as a serial 
sequence of steps.  For "user convenience" the output column names are set to 
the RHS column names.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of David Raymond
>Sent: Thursday, 7 September, 2017 10:31
>To: SQLite mailing list
>Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
>
>Although it may not translate as well to the more complex examples,
>would you also consider adding the IN operator to your tests? I found
>for example that "select v from t1 where v in t2;" did even better
>than the join or the intersect.
>
>
>
>
>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of R Smith
>Sent: Thursday, September 07, 2017 8:06 AM
>To: sqlite-users@mailinglists.sqlite.org
>Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
>
>
>On 2017/09/07 3:43 AM, Keith Medcalf wrote:
>> Try the same test using 147 columns in each table.
>
>Exactly the plan for this weekend :)
>
>> 1 column is rather trivial.  Even a kindergarten kid could do it in
>no time using crayons and the wall.
>
>So? That is non-sequitur, I am sure given enough crayons, wall-space
>and
>time, a kindergarten kid can do it with 147 columns too. That says
>exactly nothing about the possible efficiencies of different methods.
>If
>however the 1-columness of the test gets somehow advantaged by being
>the
>PK (as Nico pointed out) or real world data such as TEXT entries sort
>slower than INTs, then it might affect it, so the 147 column tests
>will
>tell.
>
>
>> In other words except in very trivial cases (like having only one
>column that is not nullable) it will be very difficult to write a
>"correct" JOIN or correlated subquery that emulates an INTERSECT.
>
>Well I agree, but it is those trivial cases that are of interest
>here,
>and if there is a general JOIN optimization to be had. The INTERSECT
>test merely served as the catalyst to put us on the trail of the
>possible JOIN optimization, if there is even an optimization to be
>had
>(it might yet be a wild goose chase, which you seem to have your
>money
>on, so watch this space, I'll graciously accept your "told ya!" later
>after testing).
>
>
>Cheers,
>Ryan
>
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-07 Thread David Raymond
Expanding things for when you get bored, in addition to  JOIN vs 
INTERSECT vs IN I'd also be interested in  JOIN vs EXCEPT vs NOT IN, as 
I tend to do more exclusion rather than intersection.

The straight up "IN tablename" may be SQLite only, but it also supports IN 
(subquery) so "select v from t1 where v in (select v from t2)" also works, 
although the "IN tablename" is so much cleaner to read in my opinion.

create table statesToSkip (
  state text primary key collate nocase
) without rowid;

select foo from bar where state not in statesToSkip.

vs

select foo from bar where state not in (select state from statesToSkip)

vs

select bar.foo from bar left outer join statesToSkip
on (bar.state = statesToSkip.state)
where statesToSkip.state is null;


-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of R Smith
Sent: Thursday, September 07, 2017 3:51 PM
To: sqlite-users@mailinglists.sqlite.org
Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

On 2017/09/07 6:31 PM, David Raymond wrote:
> Although it may not translate as well to the more complex examples, would you 
> also consider adding the IN operator to your tests? I found for example that 
> "select v from t1 where v in t2;" did even better than the join or the 
> intersect.

Will do. The only thing I have somewhat against that specific query form 
is that it doesn't work in other engines (seems to not be standard). But 
for optimization in SQLite specifically that's perfect.

> Other "am I right in thinking this" question: INTERSECT is only going to be 
> viable when comparing full records, correct? If you're looking to filter 
> table A by whether its primary key is also a primary key for table B, but 
> ignoring the other fields in both, then INTERSECT becomes not an option, or 
> at least starts making the query more complex/ugly... correct?

INTERSECT will happily match however many columns you desire (and 
specify), there is no need to match full records or single keys 
specifically.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-07 Thread Nico Williams
On Thu, Sep 07, 2017 at 09:51:07PM +0200, R Smith wrote:
> INTERSECT will happily match however many columns you desire (and specify),
> there is no need to match full records or single keys specifically.

But the two queries on either side of the set operator must have the
same number of columns (and in strongly-typed RDBMSes, the same types).

That's a huge constraint.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-07 Thread R Smith

On 2017/09/07 6:31 PM, David Raymond wrote:

Although it may not translate as well to the more complex examples, would you also 
consider adding the IN operator to your tests? I found for example that "select v 
from t1 where v in t2;" did even better than the join or the intersect.


Will do. The only thing I have somewhat against that specific query form 
is that it doesn't work in other engines (seems to not be standard). But 
for optimization in SQLite specifically that's perfect.



Other "am I right in thinking this" question: INTERSECT is only going to be 
viable when comparing full records, correct? If you're looking to filter table A by 
whether its primary key is also a primary key for table B, but ignoring the other fields 
in both, then INTERSECT becomes not an option, or at least starts making the query more 
complex/ugly... correct?


INTERSECT will happily match however many columns you desire (and 
specify), there is no need to match full records or single keys 
specifically.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-07 Thread David Raymond
Although it may not translate as well to the more complex examples, would you 
also consider adding the IN operator to your tests? I found for example that 
"select v from t1 where v in t2;" did even better than the join or the 
intersect.

Other "am I right in thinking this" question: INTERSECT is only going to be 
viable when comparing full records, correct? If you're looking to filter table 
A by whether its primary key is also a primary key for table B, but ignoring 
the other fields in both, then INTERSECT becomes not an option, or at least 
starts making the query more complex/ugly... correct?



-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of R Smith
Sent: Thursday, September 07, 2017 8:06 AM
To: sqlite-users@mailinglists.sqlite.org
Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency


On 2017/09/07 3:43 AM, Keith Medcalf wrote:
> Try the same test using 147 columns in each table.

Exactly the plan for this weekend :)

> 1 column is rather trivial.  Even a kindergarten kid could do it in no time 
> using crayons and the wall.

So? That is non-sequitur, I am sure given enough crayons, wall-space and 
time, a kindergarten kid can do it with 147 columns too. That says 
exactly nothing about the possible efficiencies of different methods. If 
however the 1-columness of the test gets somehow advantaged by being the 
PK (as Nico pointed out) or real world data such as TEXT entries sort 
slower than INTs, then it might affect it, so the 147 column tests will 
tell.


> In other words except in very trivial cases (like having only one column that 
> is not nullable) it will be very difficult to write a "correct" JOIN or 
> correlated subquery that emulates an INTERSECT.

Well I agree, but it is those trivial cases that are of interest here, 
and if there is a general JOIN optimization to be had. The INTERSECT 
test merely served as the catalyst to put us on the trail of the 
possible JOIN optimization, if there is even an optimization to be had 
(it might yet be a wild goose chase, which you seem to have your money 
on, so watch this space, I'll graciously accept your "told ya!" later 
after testing).


Cheers,
Ryan

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-07 Thread Nico Williams
On Wed, Sep 06, 2017 at 07:43:07PM -0600, Keith Medcalf wrote:
> Try the same test using 147 columns in each table.
> 
> 1 column is rather trivial.  Even a kindergarten kid could do it in no
> time using crayons and the wall.
> 
> [...]
> 
> In other words except in very trivial cases (like having only one
> column that is not nullable) it will be very difficult to write a
> "correct" JOIN or correlated subquery that emulates an INTERSECT.

Yup.  But that doesn't mean that the engine couldn't internally build a
result-set from the query without some filtering JOIN, then implement
the same strategy as an INTERSECT.  You can't do this in SQL if the
filter table has different shape than the result set, but the engine
might be able to do it.

On the other hand, building a complete result set first is... not online
behavior.  If the result set size is enormous, then the INTERSECT
approach is going to make the user very unhappy!

I do think OP's tests point out a case where SQLite3 is pessimally
picking table scan over covering index scan...

...though scanning the index
means that there will be no rowid column in the result, which might
actually be a compatibility issue when using rowid tables, so maybe
SQLite3 is doing exactly the right thing?

I don't think that pessimization is too consequential as users can
improve the situation by adding ORDER BY clauses or using WITHOUT ROWID.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-07 Thread R Smith


On 2017/09/07 3:43 AM, Keith Medcalf wrote:

Try the same test using 147 columns in each table.


Exactly the plan for this weekend :)


1 column is rather trivial.  Even a kindergarten kid could do it in no time 
using crayons and the wall.


So? That is non-sequitur, I am sure given enough crayons, wall-space and 
time, a kindergarten kid can do it with 147 columns too. That says 
exactly nothing about the possible efficiencies of different methods. If 
however the 1-columness of the test gets somehow advantaged by being the 
PK (as Nico pointed out) or real world data such as TEXT entries sort 
slower than INTs, then it might affect it, so the 147 column tests will 
tell.




In other words except in very trivial cases (like having only one column that is not 
nullable) it will be very difficult to write a "correct" JOIN or correlated 
subquery that emulates an INTERSECT.


Well I agree, but it is those trivial cases that are of interest here, 
and if there is a general JOIN optimization to be had. The INTERSECT 
test merely served as the catalyst to put us on the trail of the 
possible JOIN optimization, if there is even an optimization to be had 
(it might yet be a wild goose chase, which you seem to have your money 
on, so watch this space, I'll graciously accept your "told ya!" later 
after testing).



Cheers,
Ryan

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-06 Thread Keith Medcalf

Try the same test using 147 columns in each table.

1 column is rather trivial.  Even a kindergarten kid could do it in no time 
using crayons and the wall.

And of course the output of INTERSECT is ordered.  It uses a sorter to perform 
the intersection.  And of course the output is distinct, it uses a sorter to 
perform the intersection.

In other words,

select ... a bunch of columns ...
from table1
intersect
select ... an eual number of bunch of columns ...
from table2

is equivalent to

select ... the bunch of columns ...
  from table1
 where exists (select * from table2
where (for each column position in table 2 equals that column 
position from table1, plus of course all the added stuff needed to handle 
nullable columns))
group by ... the bunch of columns ...;

In other words except in very trivial cases (like having only one column that 
is not nullable) it will be very difficult to write a "correct" JOIN or 
correlated subquery that emulates an INTERSECT.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.


>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of R Smith
>Sent: Wednesday, 6 September, 2017 14:58
>To: sqlite-users@mailinglists.sqlite.org
>Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
>
>On 2017/09/06 8:26 PM, Nico Williams wrote:
>> On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote:
>>> -- Another interesting thing to note: The INTERSECT test produces
>ORDERED
>>> -- output, which suggests that an ORDER-BY addition to the query
>would
>>> -- favour the INTERSECT method.
>> Nothing about INTERSECT requires it to produce ordered output.
>
>No, and it was not suggested, it was just noted that it does,
>suggesting
>that it could be the more performant choice when adding an ORDER BY
>clause, which turned out to not only be true in terms of being the
>better choice, but also that it itself sped up by simply adding the
>ORDER BY clause as was demonstrated in Test 6.
>
>> Nothing about the JOIN case makes it not possible to produce
>ordered
>> output by accident.
>
>Yet it doesn't seem to by accident, which would suggest that an ORDER
>BY
>clause when added to the JOIN statements would incur an additional
>time
>penalty for having to actually order the results - Yet, as again
>demonstrated in Test 6, the ORDER BY actually sped up the JOIN query
>too
>(perhaps via forcing the Index earlier or used in a different way) -
>which was most interesting, and, as you noted, there is nothing about
>the JOIN that precludes it from having ordered output, so this
>optimization might be worthwhile.
>
>> You'll want to re-measure with an ORDER BY added.
>
>I did. It was done in Test 6. It showed significantly interesting
>results. Was my explanation lacking in clarity or did it fall down
>the
>TLDR; rabbit hole? :)
>
>
>> In any case, this is quite interesting.  Many uses of JOIN are not
>> merely to filter results, but to construct joined result rows --
>such
>> uses of JOIN cannot be optimized by using INTERSECT.  But for
>> filter-uses of JOIN... this might be a useful optimization for the
>> engine to learn.
>
>I agree, and not only the INTERSECT optimization but the tests
>suggest
>adding a silent ORDER BY would also be an optimization, though not
>sure
>if the effort-to-pleasure ratio is low enough yet. Perhaps if re-
>doing
>the tests with tables using several more non-Integer columns to see
>if
>the optimization could be generalized across all kinds of data in
>some
>way. I might pursue this later when I have some time.
>
>
>Cheers,
>Ryan
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-06 Thread Nico Williams
On Wed, Sep 06, 2017 at 11:54:35PM +0200, R Smith wrote:
> It's still remarkable that in both tests 5 and 6 I've used the very same PK
> setup, yet Test 6 was significantly faster with the added ORDER BY clause.
> In tests 1 through 4 I did not use a PK at all, just plain INT data field,
> but then I did not test the ORDER BY in those tests.

It's an optimizer issue.  It didn't occur to the optimizer that scanning
a covering index was better than scanning the table because the covering
index has (in this case, and always, for rowid tables anyways) strictly
less contents to read and decode.

Scanning the covering index has the happy side-effect (if you wanted if) of
producing ordered results and making an equivalent ORDER BY free.

(Whereas scanning the table will produce results in rowid order, which is
almost certainly not useful unless you explicitly wanted an INTEGER
PRIMARY KEY.)

Note that INTERSECT could have used a hash table, thus producing
unordered results (most likely).  But SQLite3 only knows b-trees.

All of this explains the accidental ordering / non-ordering.  And also
why you shouldn't count on it: it's all implementaton details!

But just because you know to add an ORDER BY doesn't mean you shouldn't
think to make it match some suitable index...  The optimizer is nice,
but you still have to think a little bit like an optimizer yourself :(

> It might turn out to be a wild goose chase, but that will be easily evident
> when testing without the PK and with more realistic real-world data. I'll do
> that this weekend.

You can't have a PK-less table -- SQLite3 always want some PK, even if
it's a hidden rowid column.  WITHOUT ROWID tables make this clearer:

  sqlite> create table t(a text) without rowid;
  Error: PRIMARY KEY missing on table t

Adding an explicit PK implicitly added the covering index that sped up
the JOIN (once you forced the optimizer to use it).  But you should just
always have had an explicit PK and WITHOUT ROWID.

You still found something interesting about using JOINs to filter result
sets (as opposed to adding columns to the the result set).  Something to
keep in mind...  I do a lot of queries where some of the JOINed tables
are used only for filtering.  It's not always possible to convert such
queries to INTERSECT, but it might be possible for SQLite3 to learn how
to perform the equivalent optimization internally, and when to do it.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-06 Thread R Smith


On 2017/09/06 11:17 PM, Nico Williams wrote:


If you'll redo this I'd urge you to use WITHOUT ROWIDS.  First, that's
almost always the right thing to do anyways.  Second, it won't perform
worse but likely will perform better.  Third, write performance
definitely should improve with WITHOUT ROWIDS.  Fourth, I think users
are starting to use WITHOUT ROWIDS more, so testing that seems more
useful.

Nico


All good points which I will definitely heed. I will of course test both 
types of tables but ensure that when using row_id tables I do not 
inadvertently have possible PK short-circuiting of query plans.


It's still remarkable that in both tests 5 and 6 I've used the very same 
PK setup, yet Test 6 was significantly faster with the added ORDER BY 
clause. In tests 1 through 4 I did not use a PK at all, just plain INT 
data field, but then I did not test the ORDER BY in those tests.


It might turn out to be a wild goose chase, but that will be easily 
evident when testing without the PK and with more realistic real-world 
data. I'll do that this weekend.


Cheers,
Ryan

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-06 Thread Nico Williams
On Wed, Sep 06, 2017 at 10:57:41PM +0200, R Smith wrote:
> On 2017/09/06 8:26 PM, Nico Williams wrote:
> >On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote:
> >>-- Another interesting thing to note: The INTERSECT test produces ORDERED
> >>-- output, which suggests that an ORDER-BY addition to the query would
> >>-- favour the INTERSECT method.
> >Nothing about INTERSECT requires it to produce ordered output.
> 
> No, and it was not suggested, it was just noted that it does, suggesting
> that it could be the more performant choice when adding an ORDER BY clause,
> which turned out to not only be true in terms of being the better choice,
> but also that it itself sped up by simply adding the ORDER BY clause as was
> demonstrated in Test 6.

I point this out only because users should know not to assume result set
order without an ORDER BY.

> >Nothing about the JOIN case makes it not possible to produce ordered
> >output by accident.
> 
> Yet it doesn't seem to by accident, which would suggest that an ORDER BY
> clause when added to the JOIN statements would incur an additional time

See my explanation below.

> penalty for having to actually order the results - Yet, as again
> demonstrated in Test 6, the ORDER BY actually sped up the JOIN query too
> (perhaps via forcing the Index earlier or used in a different way) - which
> was most interesting, and, as you noted, there is nothing about the JOIN
> that precludes it from having ordered output, so this optimization might be
> worthwhile.
> 
> >You'll want to re-measure with an ORDER BY added.
> 
> I did. It was done in Test 6. It showed significantly interesting results.
> Was my explanation lacking in clarity or did it fall down the TLDR; rabbit
> hole? :)

I saw it.

I think the ORDER BY helped the JOIN because it caused SQLite3 to scan a
covering index (the primary) key instead of scanning the table.  That
without it SQLite3 didn't use that index is rather inefficient, though
it may not be a win in real-world use-cases to fix that.

Of course, IF you had used WITHOUT ROWIDs you would have found (I'm
sure) that the JOIN also produced ordered results by default and was as
fast as in your 6th test.

In fact, INTERSECT does an implicit ordering step by building a b-tree
that the JOIN with the index scan optimization does not have to build at
all, so JOIN has a leg up on INTERSECT in that sense.

> I agree, and not only the INTERSECT optimization but the tests suggest
> adding a silent ORDER BY would also be an optimization, though not sure if

I think here the ORDER BY merely forced SQLite3 to pick the more
efficient query plan, and that it's probably a (rather minor) optimizer
bug that it didn't do so to begin with without the ORDER BY.

> the effort-to-pleasure ratio is low enough yet. Perhaps if re-doing the
> tests with tables using several more non-Integer columns to see if the
> optimization could be generalized across all kinds of data in some way. I
> might pursue this later when I have some time.

If you'll redo this I'd urge you to use WITHOUT ROWIDS.  First, that's
almost always the right thing to do anyways.  Second, it won't perform
worse but likely will perform better.  Third, write performance
definitely should improve with WITHOUT ROWIDS.  Fourth, I think users
are starting to use WITHOUT ROWIDS more, so testing that seems more
useful.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-06 Thread R Smith

On 2017/09/06 8:26 PM, Nico Williams wrote:

On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote:

-- Another interesting thing to note: The INTERSECT test produces ORDERED
-- output, which suggests that an ORDER-BY addition to the query would
-- favour the INTERSECT method.

Nothing about INTERSECT requires it to produce ordered output.


No, and it was not suggested, it was just noted that it does, suggesting 
that it could be the more performant choice when adding an ORDER BY 
clause, which turned out to not only be true in terms of being the 
better choice, but also that it itself sped up by simply adding the 
ORDER BY clause as was demonstrated in Test 6.



Nothing about the JOIN case makes it not possible to produce ordered
output by accident.


Yet it doesn't seem to by accident, which would suggest that an ORDER BY 
clause when added to the JOIN statements would incur an additional time 
penalty for having to actually order the results - Yet, as again 
demonstrated in Test 6, the ORDER BY actually sped up the JOIN query too 
(perhaps via forcing the Index earlier or used in a different way) - 
which was most interesting, and, as you noted, there is nothing about 
the JOIN that precludes it from having ordered output, so this 
optimization might be worthwhile.



You'll want to re-measure with an ORDER BY added.


I did. It was done in Test 6. It showed significantly interesting 
results. Was my explanation lacking in clarity or did it fall down the 
TLDR; rabbit hole? :)




In any case, this is quite interesting.  Many uses of JOIN are not
merely to filter results, but to construct joined result rows -- such
uses of JOIN cannot be optimized by using INTERSECT.  But for
filter-uses of JOIN... this might be a useful optimization for the
engine to learn.


I agree, and not only the INTERSECT optimization but the tests suggest 
adding a silent ORDER BY would also be an optimization, though not sure 
if the effort-to-pleasure ratio is low enough yet. Perhaps if re-doing 
the tests with tables using several more non-Integer columns to see if 
the optimization could be generalized across all kinds of data in some 
way. I might pursue this later when I have some time.



Cheers,
Ryan
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-06 Thread Nico Williams
On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote:
> -- Another interesting thing to note: The INTERSECT test produces ORDERED
> -- output, which suggests that an ORDER-BY addition to the query would
> -- favour the INTERSECT method.

Nothing about INTERSECT requires it to produce ordered output.

Nothing about the JOIN case makes it not possible to produce ordered
output by accident.

You'll want to re-measure with an ORDER BY added.

In any case, this is quite interesting.  Many uses of JOIN are not
merely to filter results, but to construct joined result rows -- such
uses of JOIN cannot be optimized by using INTERSECT.  But for
filter-uses of JOIN... this might be a useful optimization for the
engine to learn.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] JOIN vs. INTERSECT speed and efficiency

2017-09-06 Thread R Smith

Hi all,

For those interested, after a recent thread from a poster called Joe 
asking about the most efficient way to find values that coincide from 
two separate tables, a response from Clemens Ladisch and a further 
elaboration from myself suggested the following:


SELECT v FROM t1 INTERSECT SELECT v FROM t2

I was going to mention that that INTERSECT is simply equivalent to a 
JOIN on v (if there are no duplicate entries) of the form:


SELECT v FROM t1 JOIN t2 ON t2.v = t1.v

But then the question arose: Which is more efficient? - So at the time 
of posting I refrained from mentioning it, but it kept in my mind, until 
I decided to try and answer the question.


Next step was to set up an experiment to mimic the above and test it.


*Experiment* - If the reader is interested, or would like to compare and 
contrast with possibly different DB engines, versions, platforms or 
settings, read on, else just skip ahead to the Conclusion bit.


Setup:
- Using SQLite.
- Ran on my slowest (to magnify efficiency difference) notebook (Win10 
64bit)
- using sqlitespeed (http://www.sqlc.rifin.co.za/) but you can use any 
system since the objective measurement here is qualitative difference 
ratio, not quantitative direct units.
- with the standard pre-compiled 32-bit SQLite DLL available from the 
SQLite download page (https://sqlite.org/download.html)

- test DLL version: 3.17.0


Tests:

I will post the first test script in full with added annotation to 
explain the rationale and show the full working, but the subsequent 
scripts I will only post the money-shots and avoid the repetitive parts.
Note that the test script is run in a transaction which is not shown 
explicitly and at the end rolled back to avoid differences in tests on 
account of the file content changing during the tests.


Test 1:

  -- SQLite version 3.17.0  [ Release: 2017-02-13 ]  on SQLitespeed 
version 2.0.2.4.
  -- 



-- This first table holds Parameters for testing.
-- idxCount specifies how many entries to make in the test tables. (i.e. 
Number of added rows).
-- rndDiv is the modality chosen for the random values to control the 
magnitude of random values
-- so that the ratio of expected matching values to value-count can be 
controlled.


CREATE TABLE p (
  idxCount INT,
  rndDiv INT
);

-- Here we add the Test parameter values. This is the only section that 
changes between

-- the first 4 tests.
-- This test has a row-count of 10 (One-Hundred-Thousand) and a 
modality of 1000 times the
-- idxCount so that in all the rows, we expect roughly only about 0.1% 
of matches.
-- Note that if every value in table 1 has a 1-in-1000 chance of having 
a matching valuein
-- table 2, then for 100,000 rows there should be ~100 matches which is 
~0.1% of the row-count.


-- This test was picked as the full example since it produces the least 
amount of rows.


INSERT INTO p(idxCount, rndDiv) VALUES (10, 1000 * 10);

-- Confirm the parameters.
SELECT * FROM p;

  --   idxCount   |    rndDiv
  --  | 
  --    10    |   1


-- Following are the two test tables t1 and t2 each containing one 
column v to hold the random values.

CREATE TABLE t1 (
  v INT
);

CREATE TABLE t2 (
  v INT
);

-- Here we fill the two test tables according to our parameters above.
WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t1(v) SELECT Val FROM R
;

WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t2(v) SELECT Val FROM R
;

-- Test 1 - using JOIN.
-- Note that this is the second iteration of tests. In the first 
iteration I had the INTERSECT
-- query before the JOIN query, but I swapped it around for the second 
iteration because
-- the JOIN was generally faster for low match counts so I wanted to 
ensure neither test
-- gained an advantage from being second due to some or other caching by 
the Query planner.


SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  -- v
  -- 
  -- 1000
  --  2312703
  -- 87720925
  -- 29736409
  -- 62527166
  -- 25171143
  -- 24168552
  -- 86449735
  -- 83544235
  -- 45671286
  -- 69343788
  -- 42394827
  -- 92142603
  -- 87106564
  --  4593574
  --  6914348
  -- 16358938
  -- 12568863
  -- 20105830
  -- 91354724
  -- 87992157
  -- 17605134
  -- 28584588
  -- 78633251
  -- 98955905
  -- 19979768
  -- 20956231
  -- 30819730
  -- 93942875
  -- 45346494
  -- 96346064
  -- 32224203
  -- 89622511
  -- 39267531
  --  3116133
  -- 31172079
  -- 87828771
  -- 82931503
  -- 89108957
  -- 80067973
  -- 89366000
  -- 68319117
  -- 37802556
  -- 64391927
  -- 84515054
  -- 11071461
  -- 40682706
  -- 78441313
  -- 17977211
  --   659811
  -- 14504321
  -- 57479870
  -- 44134958
  -- 94642155
  -- 37520503
  --