[sqlite] JOIN vs. INTERSECT speed and efficiency

R Smith Wed, 06 Sep 2017 11:07:09 -0700

Hi all,

For those interested, after a recent thread from a poster called Joeasking about the most efficient way to find values that coincide fromtwo separate tables, a response from Clemens Ladisch and a furtherelaboration from myself suggested the following:


SELECT v FROM t1 INTERSECT SELECT v FROM t2

I was going to mention that that INTERSECT is simply equivalent to aJOIN on v (if there are no duplicate entries) of the form:


SELECT v FROM t1 JOIN t2 ON t2.v = t1.v

But then the question arose: Which is more efficient? - So at the timeof posting I refrained from mentioning it, but it kept in my mind, untilI decided to try and answer the question.


Next step was to set up an experiment to mimic the above and test it.

*Experiment* - If the reader is interested, or would like to compare andcontrast with possibly different DB engines, versions, platforms orsettings, read on, else just skip ahead to the Conclusion bit.


Setup:
- Using SQLite.

- Ran on my slowest (to magnify efficiency difference) notebook (Win1064bit)- using sqlitespeed (http://www.sqlc.rifin.co.za/) but you can use anysystem since the objective measurement here is qualitative differenceratio, not quantitative direct units.- with the standard pre-compiled 32-bit SQLite DLL available from theSQLite download page (https://sqlite.org/download.html)

- test DLL version: 3.17.0


Tests:

I will post the first test script in full with added annotation toexplain the rationale and show the full working, but the subsequentscripts I will only post the money-shots and avoid the repetitive parts.Note that the test script is run in a transaction which is not shownexplicitly and at the end rolled back to avoid differences in tests onaccount of the file content changing during the tests.


Test 1:

-- SQLite version 3.17.0 [ Release: 2017-02-13 ] on SQLitespeedversion 2.0.2.4. --================================================================================================


-- This first table holds Parameters for testing.

-- idxCount specifies how many entries to make in the test tables. (i.e.Number of added rows).-- rndDiv is the modality chosen for the random values to control themagnitude of random values-- so that the ratio of expected matching values to value-count can becontrolled.


CREATE TABLE p (
  idxCount INT,
  rndDiv INT
);

-- Here we add the Test parameter values. This is the only section thatchanges between

-- the first 4 tests.

-- This test has a row-count of 100000 (One-Hundred-Thousand) and amodality of 1000 times the-- idxCount so that in all the rows, we expect roughly only about 0.1%of matches.-- Note that if every value in table 1 has a 1-in-1000 chance of havinga matching valuein-- table 2, then for 100,000 rows there should be ~100 matches which is~0.1% of the row-count.

-- This test was picked as the full example since it produces the leastamount of rows.


INSERT INTO p(idxCount, rndDiv) VALUES (100000, 1000 * 100000);

-- Confirm the parameters.
SELECT * FROM p;

  --   idxCount   |    rndDiv
  -- ------------ | ------------
  --    100000    |   100000000

-- Following are the two test tables t1 and t2 each containing onecolumn v to hold the random values.

CREATE TABLE t1 (
  v INT
);

CREATE TABLE t2 (
  v INT
);

-- Here we fill the two test tables according to our parameters above.
WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t1(v) SELECT Val FROM R
;

WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t2(v) SELECT Val FROM R
;

-- Test 1 - using JOIN.

-- Note that this is the second iteration of tests. In the firstiteration I had the INTERSECT-- query before the JOIN query, but I swapped it around for the seconditeration because-- the JOIN was generally faster for low match counts so I wanted toensure neither test-- gained an advantage from being second due to some or other caching bythe Query planner.


SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --     v
  -- --------
  --     1000
  --  2312703
  -- 87720925
  -- 29736409
  -- 62527166
  -- 25171143
  -- 24168552
  -- 86449735
  -- 83544235
  -- 45671286
  -- 69343788
  -- 42394827
  -- 92142603
  -- 87106564
  --  4593574
  --  6914348
  -- 16358938
  -- 12568863
  -- 20105830
  -- 91354724
  -- 87992157
  -- 17605134
  -- 28584588
  -- 78633251
  -- 98955905
  -- 19979768
  -- 20956231
  -- 30819730
  -- 93942875
  -- 45346494
  -- 96346064
  -- 32224203
  -- 89622511
  -- 39267531
  --  3116133
  -- 31172079
  -- 87828771
  -- 82931503
  -- 89108957
  -- 80067973
  -- 89366000
  -- 68319117
  -- 37802556
  -- 64391927
  -- 84515054
  -- 11071461
  -- 40682706
  -- 78441313
  -- 17977211
  --   659811
  -- 14504321
  -- 57479870
  -- 44134958
  -- 94642155
  -- 37520503
  -- 48456547
  -- 99084161
  -- 84938198
  -- 14002200
  -- 78221942
  -- 45576636
  -- 42790976
  --  6533108
  -- 92535066
  -- 89299140
  -- 22666602
  -- 63098262
  -- 69021687
  -- 88420428
  -- 20451211
  -- 30003757
  -- 90711051
  -- 43222100
  -- 30233728
  -- 61634416
  -- 86830002
  -- 40248177
  --  1237709
  -- 58707473
  -- 39900130
  -- 92531343
  -- 78010626
  -- 67793752
  -- 29390573
  -- 23079807
  -- 80883211
  -- 15167730
  -- 15532946
  -- 84444860
  -- 89113099
  -- 70441636
  -- 47253234
  -- 49325743
  -- 55033258
  -- 90556390
  -- 86704459
  -- 45170002
  -- 47126034
  --  9496434
  -- 42163489
  -- 92938539
  -- 55105343
  -- 31110805
  -- 80062347
  -- 65114976
  --  8309007
  -- 78366148
  -- 44587619
  -- 27312379
  -- 56915827
  -- 57706349
  -- 84004545
  -- 52791919
  --  8606523
  -- 27763198
  -- 21761548
  -- 69642690

  --    Item Stats:  Item No:           8 Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         117
  --                 VM Work Steps:     1200488 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.177s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------



-- The INTERSECT test:

SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  --       v
  -- ------------
  --       1000
  --     659811
  --    1237709
  --    2312703
  --    3116133
  --    4593574
  --    6533108
  --    6914348
  --    8309007
  --    8606523
  --    9496434
  --   11071461
  --   12568863
  --   14002200
  --   14504321
  --   15167730
  --   15532946
  --   16358938
  --   17605134
  --   17977211
  --   19979768
  --   20105830
  --   20451211
  --   20956231
  --   21761548
  --   22666602
  --   23079807
  --   24168552
  --   25171143
  --   27312379
  --   27763198
  --   28584588
  --   29390573
  --   29736409
  --   30003757
  --   30233728
  --   30819730
  --   31110805
  --   31172079
  --   32224203
  --   37520503
  --   37802556
  --   39267531
  --   39900130
  --   40248177
  --   40682706
  --   42163489
  --   42394827
  --   42790976
  --   43222100
  --   44134958
  --   44587619
  --   45170002
  --   45346494
  --   45576636
  --   45671286
  --   47126034
  --   47253234
  --   48456547
  --   49325743
  --   52791919
  --   55033258
  --   55105343
  --   56915827
  --   57479870
  --   57706349
  --   58707473
  --   61634416
  --   62527166
  --   63098262
  --   64391927
  --   65114976
  --   67793752
  --   68319117
  --   69021687
  --   69343788
  --   69642690
  --   70441636
  --   78010626
  --   78221942
  --   78366148
  --   78441313
  --   78633251
  --   80062347
  --   80067973
  --   80883211
  --   82931503
  --   83544235
  --   84004545
  --   84444860
  --   84515054
  --   84938198
  --   86449735
  --   86704459
  --   86830002
  --   87106564
  --   87720925
  --   87828771
  --   87992157
  --   88420428
  --   89108957
  --   89113099
  --   89299140
  --   89366000
  --   89622511
  --   90556390
  --   90711051
  --   91354724
  --   92142603
  --   92531343
  --   92535066
  --   92938539
  --   93942875
  --   94642155
  --   96346064
  --   98955905
  --   99084161

  --    Item Stats:  Item No:           9 Query Size (Chars):  46
  --                 Result Columns:    1 Result Rows:         117
  --                 VM Work Steps:     1100093 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.228s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------

-- As you can see, the JOIN is roughly 30% faster than INTERSECT (0.177svs. 0.228s) for this test.-- Also note that the Virtual-Machine work steps needed to accomplishthe result differed-- between the two tests (1.2 million vs. 1.1 million) which is veryinteresting because the-- JOIN used ~9% more VM steps to complete (i.e. less efficient), butdid so in a faster time.

-- Another interesting thing to note: The INTERSECT test producesORDERED output, which-- suggests that an ORDER-BY addition to the query would favour theINTERSECT method.

-- (This led to Test 6).

-- This is the query plan for the INTERSECT query:

EXPLAIN QUERY PLAN SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  -- selectid | order | from | detail

-- -------- | ----- | ---- |---------------------------------------------------------

  --     1    |   0   |   0  | SCAN TABLE t1
  --     2    |   0   |   0  | SCAN TABLE t2

-- 0 | 0 | 0 | COMPOUND SUBQUERIES 1 AND 2 USING TEMPB-TREE (INTERSECT)



-- And the plan for the JOIN query:

EXPLAIN QUERY PLAN SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  -- selectid | order | from | detail

-- -------- | ----- | ---- |----------------------------------------------------

  --     0    |   0   |   0  | SCAN TABLE t1

-- 0 | 1 | 1 | SEARCH TABLE t2 USING AUTOMATIC COVERINGINDEX (v=?)


-- Cleanup.
DROP TABLE t1;

DROP TABLE t2;

-- Script Stats: Total Script Execution Time: 0d 00h 00m and00.718s -- Total Script Query Time: 0d 00h 00m and00.683s

  --                 Total Database Rows Changed: 200003
  --                 Total Virtual-Machine Steps: 8501075
  --                 Last executed Item Index:        13
  --                 Last Script Error:

--------------------------------------------------------------------------------------------------

-- Note form the log below that both queries created an automatic index(unsurprisingly as suggested by the Query Plans above).


  -- 2017-09-06 16:10:46.021  |  [Success]    Script Success.
  -- 2017-09-06 16:10:46.066  |  [Success]    Transaction Rolled back.

-- ------- DB-Engine Logs (Contains logged information from all DBconnections during run) ------ -- [2017-09-06 16:10:45.286] APPLICATION : ScriptE:\Documents\SQLiteAutoScript.sql started at 16:10:45.286 on 06 September.

  -- [2017-09-06 16:10:45.604] ERROR (284) : automatic index on t2(v)
  -- [2017-09-06 16:10:46.014] ERROR (284) : automatic index on t2(v)

--================================================================================================

Test 2: Using idxCount of 100,000 rows and ~1% possible matches (rndDivnow 100 x idxCount)



INSERT INTO p(idxCount, rndDiv) VALUES (100000, 100*100000);

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --    v
  -- -------
  --   1000
  -- 2251245
  -- 6690293
  -- 8754086
  -- 6871656
...
...
  -- 3230498
  -- 6398647
  -- 5083104

  --    Item Stats:  Item No:           8 Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         997
  --                 VM Work Steps:     1204006 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.187s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------


SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  --       v
  -- ------------
  --     1000
  --     2058
  --     12265
...
...
  --    9980854
  --    9985094

  --    Item Stats:  Item No:           9 Query Size (Chars):  46
  --                 Result Columns:    1 Result Rows:         986
  --                 VM Work Steps:     1100517 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.353s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------

Test 2 Notes: VM steps almost identical to test 1. JOIN is now neartwice as fast.Note also that JOIN produced 11 rows more than INTERSECT which suggestswe had 11 duplicate values in the tables.

Test 3: Using idxCount of 100,000 rows and ~10% possible matches (rndDivnow 10 x idxCount)


INSERT INTO p(idxCount,rndDiv) VALUES (100000, 10*100000);

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --    v
  -- ------
  --   1000
  -- 894336
  -- 296578
...
...
  -- 795418
  --  66041
  -- 800682

  --    Item Stats:  Item No:           8 Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         9993
  --                 VM Work Steps:     1239992 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.267s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------



SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  --       v
  -- ------------
  --       165
  --       301
  --       353
...
...
  --    999724
  --    999864
  --    999999

  --    Item Stats:  Item No:           9 Query Size (Chars):  46
  --                 Result Columns:    1 Result Rows:         9042
  --                 VM Work Steps:     1103714 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.285s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------

Test 3 Notes: Again similar VM steps though it seems the join gainedmost added VM steps. JOIN and INTERSECT now performing very near equal.

Test 4: Using idxCount of 100,000 rows and ~100% possible matches(rndDiv now 1 x idxCount)


INSERT INTO p(idxCount,rndDiv) VALUES (100000, 1*100000);


SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

  --    v
  -- ------
  --   1000
  -- 297764
  -- 386572
...
...
  --  18950
  --  24377

  --    Item Stats:  Item No:           8 Query Size (Chars):  45
  --                 Result Columns:    1 Result Rows:         99479
  --                 VM Work Steps:     1597937 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.812s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------


SELECT v FROM t1 INTERSECT SELECT v FROM t2;

  --   v
  -- -----
  --     0
  --     2
  --     5
  --     7
  --     8
  --    14
...
...
  -- 99992
  -- 99996
  -- 99998

  --    Item Stats:  Item No:           9 Query Size (Chars):  46
  --                 Result Columns:    1 Result Rows:         40027
  --                 VM Work Steps:     1069969 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.567s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------

Test 4 Notes: JOIN is using now more VM steps and INTERSECT is usingeven less, and INTERSECT now clearly the faster of the two, not tomention it still has ORDERED output while JOIN has not.


However...

In these higher match-count tests, the possibility of duplicate entriesin both tables t1 and t2 also increased, to the point where there is alarge percentage of duplicates in test 4 which would have produced morerows and slower execution from the JOIN query than the INTERSECT query,especially in cases where the same value was duplicate in both tables.This almost disqualifies Test 4 as an objective measure.

To better test this in the next tests, I changed the test script to onlyallow rows in table t1 and t2 that have no duplicates (made v a PRIMARYKEY and used INSERT OR IGNORE when populating the tables):


CREATE TABLE t1 (
  v INT PRIMARY KEY
);

CREATE TABLE t2 (
  v INT PRIMARY KEY
);

WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT OR IGNORE INTO t1(v) SELECT Val FROM R
;

WITH R(idx,Val) AS (
    SELECT 0, 1000
    UNION ALL
    SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT OR IGNORE INTO t2(v) SELECT Val FROM R
;

Test 5: Using idxCount of 100,000 rows and ~100% possible matches(rndDiv now 1 x idxCount) - Same as test 4 but without duplicates.


SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

-- Item Stats: Item No: 8 Query Size(Chars): 45

  --                 Result Columns:    1 Result Rows:         40099
  --                 VM Work Steps:     476303 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.411s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------


SELECT t1.v FROM t1 INTERSECT SELECT t2.v FROM t2;

-- Item Stats: Item No: 9 Query Size(Chars): 52

  --                 Result Columns:    1 Result Rows:         40099
  --                 VM Work Steps:     775787 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.553s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------

Test 5 Notes: The no-duplicates test script favoured JOIN all the way to100% matches (for both speed and efficiency, but with un-ordered resultsfor JOIN).

Test 6: Using idxCount of 100,000 rows and ~100% possible matches(rndDiv now 1 x idxCount) - Same as Test 5 but now explicitly ordered.


SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v ORDER BY t1.v;

-- Item Stats: Item No: 8 Query Size(Chars): 59

  --                 Result Columns:    1 Result Rows:         39906
  --                 VM Work Steps:     475396 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.281s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------


SELECT t1.v FROM t1 INTERSECT SELECT t2.v FROM t2 ORDER BY t1.v;

-- Item Stats: Item No: 9 Query Size(Chars): 66

  --                 Result Columns:    1 Result Rows:         39906
  --                 VM Work Steps:     1330447 Rows Modified:       0
  --                 Full Query Time:   0d 00h 00m and 00.263s
  --                 Query Result:      Success.

--------------------------------------------------------------------------------------------------

Test 6 Notes: Changing again to test for ordered results, INTERSECT wasindeed faster again, even for queries with no duplicates - but a muchbigger surprise here is that BOTH queries are significantly faster thantheir un-ordered versions even though much less efficient in theINTERSECT case (almost twice the VM steps but executing in half the time)!



*Deductions*

The following could be inferred from the tests:

- I did not show the initial set-up tests, but it determined that thetable-size directly correlates with the measured differences inefficiency - i.e. it is not a factor, so I picked a table-size (100Krows) that was a good work-out for the Query engine but that ranrelatively quick in human terms.- JOIN wins convincingly for a low number of matches, but as the numberof matching values increase, INTERSECT gets more efficient andeventually wins in situations where duplicate values occur.- JOIN's prowess is not linear, it does better (in ratio to INTERSECT)at 1% matches than at 0.1% matches, and remain superior as long as noduplicate values occur.

-  The output from INTERSECT is ordered.

- Output for both JOIN and INTERSECT is faster if you use explicitordering on the Joined/Intesected column(s) - This surprised me the most.

*Conclusion*: To be most efficient: If you expect hitting more than 10%matches out of the total rows in the tables and they contain possibleduplicate values or you need the output ordered - use INTERSECT. For allother cases, use JOIN, and either way, use ORDER BY on theJoined/Intersected column to make things faster still.



Cheers,
Ryan


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] JOIN vs. INTERSECT speed and efficiency

Reply via email to