Hi all,
For those interested, after a recent thread from a poster called Joe
asking about the most efficient way to find values that coincide from
two separate tables, a response from Clemens Ladisch and a further
elaboration from myself suggested the following:
SELECT v FROM t1 INTERSECT SELECT v FROM t2
I was going to mention that that INTERSECT is simply equivalent to a
JOIN on v (if there are no duplicate entries) of the form:
SELECT v FROM t1 JOIN t2 ON t2.v = t1.v
But then the question arose: Which is more efficient? - So at the time
of posting I refrained from mentioning it, but it kept in my mind, until
I decided to try and answer the question.
Next step was to set up an experiment to mimic the above and test it.
*Experiment* - If the reader is interested, or would like to compare and
contrast with possibly different DB engines, versions, platforms or
settings, read on, else just skip ahead to the Conclusion bit.
Setup:
- Using SQLite.
- Ran on my slowest (to magnify efficiency difference) notebook (Win10
64bit)
- using sqlitespeed (http://www.sqlc.rifin.co.za/) but you can use any
system since the objective measurement here is qualitative difference
ratio, not quantitative direct units.
- with the standard pre-compiled 32-bit SQLite DLL available from the
SQLite download page (https://sqlite.org/download.html)
- test DLL version: 3.17.0
Tests:
I will post the first test script in full with added annotation to
explain the rationale and show the full working, but the subsequent
scripts I will only post the money-shots and avoid the repetitive parts.
Note that the test script is run in a transaction which is not shown
explicitly and at the end rolled back to avoid differences in tests on
account of the file content changing during the tests.
Test 1:
-- SQLite version 3.17.0 [ Release: 2017-02-13 ] on SQLitespeed
version 2.0.2.4.
--
================================================================================================
-- This first table holds Parameters for testing.
-- idxCount specifies how many entries to make in the test tables. (i.e.
Number of added rows).
-- rndDiv is the modality chosen for the random values to control the
magnitude of random values
-- so that the ratio of expected matching values to value-count can be
controlled.
CREATE TABLE p (
idxCount INT,
rndDiv INT
);
-- Here we add the Test parameter values. This is the only section that
changes between
-- the first 4 tests.
-- This test has a row-count of 100000 (One-Hundred-Thousand) and a
modality of 1000 times the
-- idxCount so that in all the rows, we expect roughly only about 0.1%
of matches.
-- Note that if every value in table 1 has a 1-in-1000 chance of having
a matching valuein
-- table 2, then for 100,000 rows there should be ~100 matches which is
~0.1% of the row-count.
-- This test was picked as the full example since it produces the least
amount of rows.
INSERT INTO p(idxCount, rndDiv) VALUES (100000, 1000 * 100000);
-- Confirm the parameters.
SELECT * FROM p;
-- idxCount | rndDiv
-- ------------ | ------------
-- 100000 | 100000000
-- Following are the two test tables t1 and t2 each containing one
column v to hold the random values.
CREATE TABLE t1 (
v INT
);
CREATE TABLE t2 (
v INT
);
-- Here we fill the two test tables according to our parameters above.
WITH R(idx,Val) AS (
SELECT 0, 1000
UNION ALL
SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t1(v) SELECT Val FROM R
;
WITH R(idx,Val) AS (
SELECT 0, 1000
UNION ALL
SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t2(v) SELECT Val FROM R
;
-- Test 1 - using JOIN.
-- Note that this is the second iteration of tests. In the first
iteration I had the INTERSECT
-- query before the JOIN query, but I swapped it around for the second
iteration because
-- the JOIN was generally faster for low match counts so I wanted to
ensure neither test
-- gained an advantage from being second due to some or other caching by
the Query planner.
SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;
-- v
-- --------
-- 1000
-- 2312703
-- 87720925
-- 29736409
-- 62527166
-- 25171143
-- 24168552
-- 86449735
-- 83544235
-- 45671286
-- 69343788
-- 42394827
-- 92142603
-- 87106564
-- 4593574
-- 6914348
-- 16358938
-- 12568863
-- 20105830
-- 91354724
-- 87992157
-- 17605134
-- 28584588
-- 78633251
-- 98955905
-- 19979768
-- 20956231
-- 30819730
-- 93942875
-- 45346494
-- 96346064
-- 32224203
-- 89622511
-- 39267531
-- 3116133
-- 31172079
-- 87828771
-- 82931503
-- 89108957
-- 80067973
-- 89366000
-- 68319117
-- 37802556
-- 64391927
-- 84515054
-- 11071461
-- 40682706
-- 78441313
-- 17977211
-- 659811
-- 14504321
-- 57479870
-- 44134958
-- 94642155
-- 37520503
-- 48456547
-- 99084161
-- 84938198
-- 14002200
-- 78221942
-- 45576636
-- 42790976
-- 6533108
-- 92535066
-- 89299140
-- 22666602
-- 63098262
-- 69021687
-- 88420428
-- 20451211
-- 30003757
-- 90711051
-- 43222100
-- 30233728
-- 61634416
-- 86830002
-- 40248177
-- 1237709
-- 58707473
-- 39900130
-- 92531343
-- 78010626
-- 67793752
-- 29390573
-- 23079807
-- 80883211
-- 15167730
-- 15532946
-- 84444860
-- 89113099
-- 70441636
-- 47253234
-- 49325743
-- 55033258
-- 90556390
-- 86704459
-- 45170002
-- 47126034
-- 9496434
-- 42163489
-- 92938539
-- 55105343
-- 31110805
-- 80062347
-- 65114976
-- 8309007
-- 78366148
-- 44587619
-- 27312379
-- 56915827
-- 57706349
-- 84004545
-- 52791919
-- 8606523
-- 27763198
-- 21761548
-- 69642690
-- Item Stats: Item No: 8 Query Size (Chars): 45
-- Result Columns: 1 Result Rows: 117
-- VM Work Steps: 1200488 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.177s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
-- The INTERSECT test:
SELECT v FROM t1 INTERSECT SELECT v FROM t2;
-- v
-- ------------
-- 1000
-- 659811
-- 1237709
-- 2312703
-- 3116133
-- 4593574
-- 6533108
-- 6914348
-- 8309007
-- 8606523
-- 9496434
-- 11071461
-- 12568863
-- 14002200
-- 14504321
-- 15167730
-- 15532946
-- 16358938
-- 17605134
-- 17977211
-- 19979768
-- 20105830
-- 20451211
-- 20956231
-- 21761548
-- 22666602
-- 23079807
-- 24168552
-- 25171143
-- 27312379
-- 27763198
-- 28584588
-- 29390573
-- 29736409
-- 30003757
-- 30233728
-- 30819730
-- 31110805
-- 31172079
-- 32224203
-- 37520503
-- 37802556
-- 39267531
-- 39900130
-- 40248177
-- 40682706
-- 42163489
-- 42394827
-- 42790976
-- 43222100
-- 44134958
-- 44587619
-- 45170002
-- 45346494
-- 45576636
-- 45671286
-- 47126034
-- 47253234
-- 48456547
-- 49325743
-- 52791919
-- 55033258
-- 55105343
-- 56915827
-- 57479870
-- 57706349
-- 58707473
-- 61634416
-- 62527166
-- 63098262
-- 64391927
-- 65114976
-- 67793752
-- 68319117
-- 69021687
-- 69343788
-- 69642690
-- 70441636
-- 78010626
-- 78221942
-- 78366148
-- 78441313
-- 78633251
-- 80062347
-- 80067973
-- 80883211
-- 82931503
-- 83544235
-- 84004545
-- 84444860
-- 84515054
-- 84938198
-- 86449735
-- 86704459
-- 86830002
-- 87106564
-- 87720925
-- 87828771
-- 87992157
-- 88420428
-- 89108957
-- 89113099
-- 89299140
-- 89366000
-- 89622511
-- 90556390
-- 90711051
-- 91354724
-- 92142603
-- 92531343
-- 92535066
-- 92938539
-- 93942875
-- 94642155
-- 96346064
-- 98955905
-- 99084161
-- Item Stats: Item No: 9 Query Size (Chars): 46
-- Result Columns: 1 Result Rows: 117
-- VM Work Steps: 1100093 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.228s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
-- As you can see, the JOIN is roughly 30% faster than INTERSECT (0.177s
vs. 0.228s) for this test.
-- Also note that the Virtual-Machine work steps needed to accomplish
the result differed
-- between the two tests (1.2 million vs. 1.1 million) which is very
interesting because the
-- JOIN used ~9% more VM steps to complete (i.e. less efficient), but
did so in a faster time.
-- Another interesting thing to note: The INTERSECT test produces
ORDERED output, which
-- suggests that an ORDER-BY addition to the query would favour the
INTERSECT method.
-- (This led to Test 6).
-- This is the query plan for the INTERSECT query:
EXPLAIN QUERY PLAN SELECT v FROM t1 INTERSECT SELECT v FROM t2;
-- selectid | order | from | detail
-- -------- | ----- | ---- |
---------------------------------------------------------
-- 1 | 0 | 0 | SCAN TABLE t1
-- 2 | 0 | 0 | SCAN TABLE t2
-- 0 | 0 | 0 | COMPOUND SUBQUERIES 1 AND 2 USING TEMP
B-TREE (INTERSECT)
-- And the plan for the JOIN query:
EXPLAIN QUERY PLAN SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;
-- selectid | order | from | detail
-- -------- | ----- | ---- |
----------------------------------------------------
-- 0 | 0 | 0 | SCAN TABLE t1
-- 0 | 1 | 1 | SEARCH TABLE t2 USING AUTOMATIC COVERING
INDEX (v=?)
-- Cleanup.
DROP TABLE t1;
DROP TABLE t2;
-- Script Stats: Total Script Execution Time: 0d 00h 00m and
00.718s
-- Total Script Query Time: 0d 00h 00m and
00.683s
-- Total Database Rows Changed: 200003
-- Total Virtual-Machine Steps: 8501075
-- Last executed Item Index: 13
-- Last Script Error:
--
------------------------------------------------------------------------------------------------
-- Note form the log below that both queries created an automatic index
(unsurprisingly as suggested by the Query Plans above).
-- 2017-09-06 16:10:46.021 | [Success] Script Success.
-- 2017-09-06 16:10:46.066 | [Success] Transaction Rolled back.
-- ------- DB-Engine Logs (Contains logged information from all DB
connections during run) ------
-- [2017-09-06 16:10:45.286] APPLICATION : Script
E:\Documents\SQLiteAutoScript.sql started at 16:10:45.286 on 06 September.
-- [2017-09-06 16:10:45.604] ERROR (284) : automatic index on t2(v)
-- [2017-09-06 16:10:46.014] ERROR (284) : automatic index on t2(v)
--
================================================================================================
Test 2: Using idxCount of 100,000 rows and ~1% possible matches (rndDiv
now 100 x idxCount)
INSERT INTO p(idxCount, rndDiv) VALUES (100000, 100*100000);
SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;
-- v
-- -------
-- 1000
-- 2251245
-- 6690293
-- 8754086
-- 6871656
...
...
-- 3230498
-- 6398647
-- 5083104
-- Item Stats: Item No: 8 Query Size (Chars): 45
-- Result Columns: 1 Result Rows: 997
-- VM Work Steps: 1204006 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.187s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
SELECT v FROM t1 INTERSECT SELECT v FROM t2;
-- v
-- ------------
-- 1000
-- 2058
-- 12265
...
...
-- 9980854
-- 9985094
-- Item Stats: Item No: 9 Query Size (Chars): 46
-- Result Columns: 1 Result Rows: 986
-- VM Work Steps: 1100517 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.353s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
Test 2 Notes: VM steps almost identical to test 1. JOIN is now near
twice as fast.
Note also that JOIN produced 11 rows more than INTERSECT which suggests
we had 11 duplicate values in the tables.
Test 3: Using idxCount of 100,000 rows and ~10% possible matches (rndDiv
now 10 x idxCount)
INSERT INTO p(idxCount,rndDiv) VALUES (100000, 10*100000);
SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;
-- v
-- ------
-- 1000
-- 894336
-- 296578
...
...
-- 795418
-- 66041
-- 800682
-- Item Stats: Item No: 8 Query Size (Chars): 45
-- Result Columns: 1 Result Rows: 9993
-- VM Work Steps: 1239992 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.267s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
SELECT v FROM t1 INTERSECT SELECT v FROM t2;
-- v
-- ------------
-- 165
-- 301
-- 353
...
...
-- 999724
-- 999864
-- 999999
-- Item Stats: Item No: 9 Query Size (Chars): 46
-- Result Columns: 1 Result Rows: 9042
-- VM Work Steps: 1103714 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.285s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
Test 3 Notes: Again similar VM steps though it seems the join gained
most added VM steps. JOIN and INTERSECT now performing very near equal.
Test 4: Using idxCount of 100,000 rows and ~100% possible matches
(rndDiv now 1 x idxCount)
INSERT INTO p(idxCount,rndDiv) VALUES (100000, 1*100000);
SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;
-- v
-- ------
-- 1000
-- 297764
-- 386572
...
...
-- 18950
-- 24377
-- Item Stats: Item No: 8 Query Size (Chars): 45
-- Result Columns: 1 Result Rows: 99479
-- VM Work Steps: 1597937 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.812s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
SELECT v FROM t1 INTERSECT SELECT v FROM t2;
-- v
-- -----
-- 0
-- 2
-- 5
-- 7
-- 8
-- 14
...
...
-- 99992
-- 99996
-- 99998
-- Item Stats: Item No: 9 Query Size (Chars): 46
-- Result Columns: 1 Result Rows: 40027
-- VM Work Steps: 1069969 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.567s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
Test 4 Notes: JOIN is using now more VM steps and INTERSECT is using
even less, and INTERSECT now clearly the faster of the two, not to
mention it still has ORDERED output while JOIN has not.
However...
In these higher match-count tests, the possibility of duplicate entries
in both tables t1 and t2 also increased, to the point where there is a
large percentage of duplicates in test 4 which would have produced more
rows and slower execution from the JOIN query than the INTERSECT query,
especially in cases where the same value was duplicate in both tables.
This almost disqualifies Test 4 as an objective measure.
To better test this in the next tests, I changed the test script to only
allow rows in table t1 and t2 that have no duplicates (made v a PRIMARY
KEY and used INSERT OR IGNORE when populating the tables):
CREATE TABLE t1 (
v INT PRIMARY KEY
);
CREATE TABLE t2 (
v INT PRIMARY KEY
);
WITH R(idx,Val) AS (
SELECT 0, 1000
UNION ALL
SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT OR IGNORE INTO t1(v) SELECT Val FROM R
;
WITH R(idx,Val) AS (
SELECT 0, 1000
UNION ALL
SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT OR IGNORE INTO t2(v) SELECT Val FROM R
;
Test 5: Using idxCount of 100,000 rows and ~100% possible matches
(rndDiv now 1 x idxCount) - Same as test 4 but without duplicates.
SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;
-- Item Stats: Item No: 8 Query Size
(Chars): 45
-- Result Columns: 1 Result Rows: 40099
-- VM Work Steps: 476303 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.411s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
SELECT t1.v FROM t1 INTERSECT SELECT t2.v FROM t2;
-- Item Stats: Item No: 9 Query Size
(Chars): 52
-- Result Columns: 1 Result Rows: 40099
-- VM Work Steps: 775787 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.553s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
Test 5 Notes: The no-duplicates test script favoured JOIN all the way to
100% matches (for both speed and efficiency, but with un-ordered results
for JOIN).
Test 6: Using idxCount of 100,000 rows and ~100% possible matches
(rndDiv now 1 x idxCount) - Same as Test 5 but now explicitly ordered.
SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v ORDER BY t1.v;
-- Item Stats: Item No: 8 Query Size
(Chars): 59
-- Result Columns: 1 Result Rows: 39906
-- VM Work Steps: 475396 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.281s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
SELECT t1.v FROM t1 INTERSECT SELECT t2.v FROM t2 ORDER BY t1.v;
-- Item Stats: Item No: 9 Query Size
(Chars): 66
-- Result Columns: 1 Result Rows: 39906
-- VM Work Steps: 1330447 Rows Modified: 0
-- Full Query Time: 0d 00h 00m and 00.263s
-- Query Result: Success.
--
------------------------------------------------------------------------------------------------
Test 6 Notes: Changing again to test for ordered results, INTERSECT was
indeed faster again, even for queries with no duplicates - but a much
bigger surprise here is that BOTH queries are significantly faster than
their un-ordered versions even though much less efficient in the
INTERSECT case (almost twice the VM steps but executing in half the time)!
*Deductions*
The following could be inferred from the tests:
- I did not show the initial set-up tests, but it determined that the
table-size directly correlates with the measured differences in
efficiency - i.e. it is not a factor, so I picked a table-size (100K
rows) that was a good work-out for the Query engine but that ran
relatively quick in human terms.
- JOIN wins convincingly for a low number of matches, but as the number
of matching values increase, INTERSECT gets more efficient and
eventually wins in situations where duplicate values occur.
- JOIN's prowess is not linear, it does better (in ratio to INTERSECT)
at 1% matches than at 0.1% matches, and remain superior as long as no
duplicate values occur.
- The output from INTERSECT is ordered.
- Output for both JOIN and INTERSECT is faster if you use explicit
ordering on the Joined/Intesected column(s) - This surprised me the most.
*Conclusion*: To be most efficient: If you expect hitting more than 10%
matches out of the total rows in the tables and they contain possible
duplicate values or you need the output ordered - use INTERSECT. For all
other cases, use JOIN, and either way, use ORDER BY on the
Joined/Intersected column to make things faster still.
Cheers,
Ryan
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users