Hi all,

For those interested, after a recent thread from a poster called Joe asking about the most efficient way to find values that coincide from two separate tables, a response from Clemens Ladisch and a further elaboration from myself suggested the following:

SELECT v FROM t1 INTERSECT SELECT v FROM t2

I was going to mention that that INTERSECT is simply equivalent to a JOIN on v (if there are no duplicate entries) of the form:

SELECT v FROM t1 JOIN t2 ON t2.v = t1.v

But then the question arose: Which is more efficient? - So at the time of posting I refrained from mentioning it, but it kept in my mind, until I decided to try and answer the question.

Next step was to set up an experiment to mimic the above and test it.


*Experiment* - If the reader is interested, or would like to compare and contrast with possibly different DB engines, versions, platforms or settings, read on, else just skip ahead to the Conclusion bit.

Setup:
- Using SQLite.
- Ran on my slowest (to magnify efficiency difference) notebook (Win10 64bit) - using sqlitespeed (http://www.sqlc.rifin.co.za/) but you can use any system since the objective measurement here is qualitative difference ratio, not quantitative direct units. - with the standard pre-compiled 32-bit SQLite DLL available from the SQLite download page (https://sqlite.org/download.html)
- test DLL version: 3.17.0


Tests:

I will post the first test script in full with added annotation to explain the rationale and show the full working, but the subsequent scripts I will only post the money-shots and avoid the repetitive parts. Note that the test script is run in a transaction which is not shown explicitly and at the end rolled back to avoid differences in tests on account of the file content changing during the tests.

Test 1:

  -- SQLite version 3.17.0  [ Release: 2017-02-13 ]  on SQLitespeed version 2.0.2.4.   -- ================================================================================================

-- This first table holds Parameters for testing.
-- idxCount specifies how many entries to make in the test tables. (i.e. Number of added rows). -- rndDiv is the modality chosen for the random values to control the magnitude of random values -- so that the ratio of expected matching values to value-count can be controlled.

CREATE TABLE p (
  idxCount INT,
  rndDiv INT
);

-- Here we add the Test parameter values. This is the only section that changes between
-- the first 4 tests.
-- This test has a row-count of 100000 (One-Hundred-Thousand) and a modality of 1000 times the -- idxCount so that in all the rows, we expect roughly only about 0.1% of matches. -- Note that if every value in table 1 has a 1-in-1000 chance of having a matching valuein -- table 2, then for 100,000 rows there should be ~100 matches which is ~0.1% of the row-count.

-- This test was picked as the full example since it produces the least amount of rows.

INSERT INTO p(idxCount, rndDiv) VALUES (100000, 1000 * 100000);

-- Confirm the parameters.
SELECT * FROM p;

  --   idxCount   |    rndDiv
  -- ------------ | ------------
  --    100000    |   100000000


-- Following are the two test tables t1 and t2 each containing one column v to hold the random values.
CREATE TABLE t1 (
  v INT
);

CREATE TABLE t2 (
  v INT
);

-- Here we fill the two test tables according to our parameters above.
WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t1(v) SELECT Val FROM R
;

WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t2(v) SELECT Val FROM R
;

-- Test 1 - using JOIN.
-- Note that this is the second iteration of tests. In the first iteration I had the INTERSECT -- query before the JOIN query, but I swapped it around for the second iteration because -- the JOIN was generally faster for low match counts so I wanted to ensure neither test -- gained an advantage from being second due to some or other caching by the Query planner.

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --     v
  -- --------
  --     1000
  --  2312703
  -- 87720925
  -- 29736409
  -- 62527166
  -- 25171143
  -- 24168552
  -- 86449735
  -- 83544235
  -- 45671286
  -- 69343788
  -- 42394827
  -- 92142603
  -- 87106564
  --  4593574
  --  6914348
  -- 16358938
  -- 12568863
  -- 20105830
  -- 91354724
  -- 87992157
  -- 17605134
  -- 28584588
  -- 78633251
  -- 98955905
  -- 19979768
  -- 20956231
  -- 30819730
  -- 93942875
  -- 45346494
  -- 96346064
  -- 32224203
  -- 89622511
  -- 39267531
  --  3116133
  -- 31172079
  -- 87828771
  -- 82931503
  -- 89108957
  -- 80067973
  -- 89366000
  -- 68319117
  -- 37802556
  -- 64391927
  -- 84515054
  -- 11071461
  -- 40682706
  -- 78441313
  -- 17977211
  --   659811
  -- 14504321
  -- 57479870
  -- 44134958
  -- 94642155
  -- 37520503
  -- 48456547
  -- 99084161
  -- 84938198
  -- 14002200
  -- 78221942
  -- 45576636
  -- 42790976
  --  6533108
  -- 92535066
  -- 89299140
  -- 22666602
  -- 63098262
  -- 69021687
  -- 88420428
  -- 20451211
  -- 30003757
  -- 90711051
  -- 43222100
  -- 30233728
  -- 61634416
  -- 86830002
  -- 40248177
  --  1237709
  -- 58707473
  -- 39900130
  -- 92531343
  -- 78010626
  -- 67793752
  -- 29390573
  -- 23079807
  -- 80883211
  -- 15167730
  -- 15532946
  -- 84444860
  -- 89113099
  -- 70441636
  -- 47253234
  -- 49325743
  -- 55033258
  -- 90556390
  -- 86704459
  -- 45170002
  -- 47126034
  --  9496434
  -- 42163489
  -- 92938539
  -- 55105343
  -- 31110805
  -- 80062347
  -- 65114976
  --  8309007
  -- 78366148
  -- 44587619
  -- 27312379
  -- 56915827
  -- 57706349
  -- 84004545
  -- 52791919
  --  8606523
  -- 27763198
  -- 21761548
  -- 69642690

  --    Item Stats:  Item No:           8 Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         117
  --                 VM Work Steps:     1200488 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.177s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------


-- The INTERSECT test:

SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  --       v
  -- ------------
  --       1000
  --     659811
  --    1237709
  --    2312703
  --    3116133
  --    4593574
  --    6533108
  --    6914348
  --    8309007
  --    8606523
  --    9496434
  --   11071461
  --   12568863
  --   14002200
  --   14504321
  --   15167730
  --   15532946
  --   16358938
  --   17605134
  --   17977211
  --   19979768
  --   20105830
  --   20451211
  --   20956231
  --   21761548
  --   22666602
  --   23079807
  --   24168552
  --   25171143
  --   27312379
  --   27763198
  --   28584588
  --   29390573
  --   29736409
  --   30003757
  --   30233728
  --   30819730
  --   31110805
  --   31172079
  --   32224203
  --   37520503
  --   37802556
  --   39267531
  --   39900130
  --   40248177
  --   40682706
  --   42163489
  --   42394827
  --   42790976
  --   43222100
  --   44134958
  --   44587619
  --   45170002
  --   45346494
  --   45576636
  --   45671286
  --   47126034
  --   47253234
  --   48456547
  --   49325743
  --   52791919
  --   55033258
  --   55105343
  --   56915827
  --   57479870
  --   57706349
  --   58707473
  --   61634416
  --   62527166
  --   63098262
  --   64391927
  --   65114976
  --   67793752
  --   68319117
  --   69021687
  --   69343788
  --   69642690
  --   70441636
  --   78010626
  --   78221942
  --   78366148
  --   78441313
  --   78633251
  --   80062347
  --   80067973
  --   80883211
  --   82931503
  --   83544235
  --   84004545
  --   84444860
  --   84515054
  --   84938198
  --   86449735
  --   86704459
  --   86830002
  --   87106564
  --   87720925
  --   87828771
  --   87992157
  --   88420428
  --   89108957
  --   89113099
  --   89299140
  --   89366000
  --   89622511
  --   90556390
  --   90711051
  --   91354724
  --   92142603
  --   92531343
  --   92535066
  --   92938539
  --   93942875
  --   94642155
  --   96346064
  --   98955905
  --   99084161

  --    Item Stats:  Item No:           9 Query Size (Chars):  46
  --                 Result Columns:    1 Result Rows:         117
  --                 VM Work Steps:     1100093 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.228s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

-- As you can see, the JOIN is roughly 30% faster than INTERSECT (0.177s vs. 0.228s) for this test. -- Also note that the Virtual-Machine work steps needed to accomplish the result differed -- between the two tests (1.2 million vs. 1.1 million) which is very interesting because the -- JOIN used ~9% more VM steps to complete (i.e. less efficient), but did so in a faster time.

-- Another interesting thing to note: The INTERSECT test produces ORDERED output, which -- suggests that an ORDER-BY addition to the query would favour the INTERSECT method.
-- (This led to Test 6).

-- This is the query plan for the INTERSECT query:

EXPLAIN QUERY PLAN SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  -- selectid | order | from | detail
  -- -------- | ----- | ---- | ---------------------------------------------------------
  --     1    |   0   |   0  | SCAN TABLE t1
  --     2    |   0   |   0  | SCAN TABLE t2
  --     0    |   0   |   0  | COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (INTERSECT)


-- And the plan for the JOIN query:

EXPLAIN QUERY PLAN SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  -- selectid | order | from | detail
  -- -------- | ----- | ---- | ----------------------------------------------------
  --     0    |   0   |   0  | SCAN TABLE t1
  --     0    |   1   |   1  | SEARCH TABLE t2 USING AUTOMATIC COVERING INDEX (v=?)

-- Cleanup.
DROP TABLE t1;

DROP TABLE t2;

  --   Script Stats: Total Script Execution Time:     0d 00h 00m and 00.718s   --                 Total Script Query Time:         0d 00h 00m and 00.683s
  --                 Total Database Rows Changed: 200003
  --                 Total Virtual-Machine Steps: 8501075
  --                 Last executed Item Index:        13
  --                 Last Script Error:
  -- ------------------------------------------------------------------------------------------------


-- Note form the log below that both queries created an automatic index (unsurprisingly as suggested by the Query Plans above).

  -- 2017-09-06 16:10:46.021  |  [Success]    Script Success.
  -- 2017-09-06 16:10:46.066  |  [Success]    Transaction Rolled back.
  -- -------  DB-Engine Logs (Contains logged information from all DB connections during run)  ------   -- [2017-09-06 16:10:45.286] APPLICATION : Script E:\Documents\SQLiteAutoScript.sql started at 16:10:45.286 on 06 September.
  -- [2017-09-06 16:10:45.604] ERROR (284) : automatic index on t2(v)
  -- [2017-09-06 16:10:46.014] ERROR (284) : automatic index on t2(v)
  -- ================================================================================================


Test 2: Using idxCount of 100,000 rows and ~1% possible matches (rndDiv now 100 x idxCount)


INSERT INTO p(idxCount, rndDiv) VALUES (100000, 100*100000);

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --    v
  -- -------
  --   1000
  -- 2251245
  -- 6690293
  -- 8754086
  -- 6871656
...
...
  -- 3230498
  -- 6398647
  -- 5083104

  --    Item Stats:  Item No:           8 Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         997
  --                 VM Work Steps:     1204006 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.187s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  --       v
  -- ------------
  --     1000
  --     2058
  --     12265
...
...
  --    9980854
  --    9985094

  --    Item Stats:  Item No:           9 Query Size (Chars):  46
  --                 Result Columns:    1 Result Rows:         986
  --                 VM Work Steps:     1100517 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.353s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

Test 2 Notes: VM steps almost identical to test 1. JOIN is now near twice as fast. Note also that JOIN produced 11 rows more than INTERSECT which suggests we had 11 duplicate values in the tables.



Test 3: Using idxCount of 100,000 rows and ~10% possible matches (rndDiv now 10 x idxCount)

INSERT INTO p(idxCount,rndDiv) VALUES (100000, 10*100000);

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --    v
  -- ------
  --   1000
  -- 894336
  -- 296578
...
...
  -- 795418
  --  66041
  -- 800682

  --    Item Stats:  Item No:           8 Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         9993
  --                 VM Work Steps:     1239992 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.267s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------


SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  --       v
  -- ------------
  --       165
  --       301
  --       353
...
...
  --    999724
  --    999864
  --    999999

  --    Item Stats:  Item No:           9 Query Size (Chars):  46
  --                 Result Columns:    1 Result Rows:         9042
  --                 VM Work Steps:     1103714 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.285s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

Test 3 Notes: Again similar VM steps though it seems the join gained most added VM steps. JOIN and INTERSECT now performing very near equal.

Test 4: Using idxCount of 100,000 rows and ~100% possible matches (rndDiv now 1 x idxCount)

INSERT INTO p(idxCount,rndDiv) VALUES (100000, 1*100000);


SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --    v
  -- ------
  --   1000
  -- 297764
  -- 386572
...
...
  --  18950
  --  24377

  --    Item Stats:  Item No:           8 Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         99479
  --                 VM Work Steps:     1597937 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.812s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  --   v
  -- -----
  --     0
  --     2
  --     5
  --     7
  --     8
  --    14
...
...
  -- 99992
  -- 99996
  -- 99998

  --    Item Stats:  Item No:           9 Query Size (Chars):  46
  --                 Result Columns:    1 Result Rows:         40027
  --                 VM Work Steps:     1069969 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.567s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

Test 4 Notes: JOIN is using now more VM steps and INTERSECT is using even less, and INTERSECT now clearly the faster of the two, not to mention it still has ORDERED output while JOIN has not.

However...

In these higher match-count tests, the possibility of duplicate entries in both tables t1 and t2 also increased, to the point where there is a large percentage of duplicates in test 4 which would have produced more rows and slower execution from the JOIN query than the INTERSECT query, especially in cases where the same value was duplicate in both tables. This almost disqualifies Test 4 as an objective measure.

To better test this in the next tests, I changed the test script to only allow rows in table t1 and t2 that have no duplicates (made v a PRIMARY KEY and used INSERT OR IGNORE when populating the tables):

CREATE TABLE t1 (
  v INT PRIMARY KEY
);

CREATE TABLE t2 (
  v INT PRIMARY KEY
);

WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT OR IGNORE INTO t1(v) SELECT Val FROM R
;

WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT OR IGNORE INTO t2(v) SELECT Val FROM R
;

Test 5: Using idxCount of 100,000 rows and ~100% possible matches (rndDiv now 1 x idxCount) - Same as test 4 but without duplicates.

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --    Item Stats:  Item No:           8             Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         40099
  --                 VM Work Steps:     476303 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.411s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

SELECT t1.v FROM t1 INTERSECT SELECT t2.v FROM t2;

  --    Item Stats:  Item No:           9             Query Size (Chars):  52
  --                 Result Columns:    1 Result Rows:         40099
  --                 VM Work Steps:     775787 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.553s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

Test 5 Notes: The no-duplicates test script favoured JOIN all the way to 100% matches (for both speed and efficiency, but with un-ordered results for JOIN).


Test 6: Using idxCount of 100,000 rows and ~100% possible matches (rndDiv now 1 x idxCount) - Same as Test 5 but now explicitly ordered.

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v ORDER BY t1.v;

  --    Item Stats:  Item No:           8             Query Size (Chars):  59
  --                 Result Columns:    1 Result Rows:         39906
  --                 VM Work Steps:     475396 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.281s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

SELECT t1.v FROM t1 INTERSECT SELECT t2.v FROM t2 ORDER BY t1.v;

  --    Item Stats:  Item No:           9             Query Size (Chars):  66
  --                 Result Columns:    1 Result Rows:         39906
  --                 VM Work Steps:     1330447 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.263s
  --                 Query Result:      Success.
  -- ------------------------------------------------------------------------------------------------

Test 6 Notes: Changing again to test for ordered results, INTERSECT was indeed faster again, even for queries with no duplicates - but a much bigger surprise here is that BOTH queries are significantly faster than their un-ordered versions even though much less efficient in the INTERSECT case (almost twice the VM steps but executing in half the time)!


*Deductions*

The following could be inferred from the tests:
-  I did not show the initial set-up tests, but it determined that the table-size directly correlates with the measured differences in efficiency - i.e. it is not a factor, so I picked a table-size (100K rows) that was a good work-out for the Query engine but that ran relatively quick in human terms. -  JOIN wins convincingly for a low number of matches, but as the number of matching values increase, INTERSECT gets more efficient and eventually wins in situations where duplicate values occur. -  JOIN's prowess is not linear, it does better (in ratio to INTERSECT) at 1% matches than at 0.1% matches, and remain superior as long as no duplicate values occur.
-  The output from INTERSECT is ordered.
-  Output for both JOIN and INTERSECT is faster if you use explicit ordering on the Joined/Intesected column(s) - This surprised me the most.


*Conclusion*: To be most efficient: If you expect hitting more than 10% matches out of the total rows in the tables and they contain possible duplicate values or you need the output ordered - use INTERSECT. For all other cases, use JOIN, and either way, use ORDER BY on the Joined/Intersected column to make things faster still.


Cheers,
Ryan


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to