[sqlite] JOIN vs. INTERSECT vs. WHERE IN (...) - speed and efficiency differences

R Smith Sat, 09 Sep 2017 08:56:59 -0700

Full tests completed with findings ranging from less interesting toexposing a rather significant inefficiency in SQLite.

I won't post all the tests because that would take far too much space,in stead I will simply discuss the experiment and findings and post thetest script so that anyone can verify the findings.



*Problem*:

Earlier tests suggested that, in a query where the common rows betweentwo tables need to be found, JOIN is faster than INTERSECT when thereare very few matches with efficiencies evening out as the number ofmatches increase. It suggested, more pertinently, that an ORDER BYclause when added to both types of queries improved all-roundperformance by a significant margin (moreso in JOIN's case).After examination of the tests, comments suggested that a lot of theefficiency differences may be due to:

- row_id involvement,

- checks only done on Primary key values as opposed to common rowvalues, and- test tables having one or two columns only not matching any real-worldscenario which made the findings irrelevant.It was also pointed out that apart from JOIN and INTERSECT, a WHERE IN(SELECT FROM ...) clause could achieve the same query result set andshould form part of the tests.



*Purpose*:

This experiment and tests aimed to establish the differences inefficiency for queries with the same results that could be posed indifferent ways to the Query Planner in specifically SQLite.The possible ways to ask the same question that was tested includedsimilar queries (producing the same result set) when finding the commonset of rows between two tables using JOIN, INTERSECT and WHERE IN (...),and secondly similar queries finding the exclusive set of rows betweentwo tables using OUTER JOIN, EXCEPT and WHERE NOT IN (...).[SQLite doesn't implement a full outer join so that specific test wasrestricted to comparing EXCEPT vs. WHERE NOT IN (...)]



*Method*:

A simple SQL script that first establishes a table "p" with parametersthat dictate the data shape added to the tests. Next a random valuetable "rnd" is set up with the number of rows and randomness scopedictated by table "p" and then two test tables "t1" and "t2" with each151 columns (1 Key and 30 each of 5 data types - INT, REAL, NUMERIC(Dates), BLOB and TEXT) which gets populated with randomly selected rowsfrom the "rnd" table so that many rows appear only in either table, andsome rows appear in both (coincidence).

The total number of rows to choose from is dictated by the "p" table's"rndCount" field and the approximate percentage of rows that appear inboth tables is dictated by the "rndAddFreq" (random row addingfrequency) field.Up to here the tests are all similar, except that for the EXCEPT / JOINWHERE NOT IN (...) a much higher frequency of coincidence between thetable rows were selected since the output count is opposite to the othertests.The queries designed to produce the exact same output using thedifferent methods discussed above are run in turn. The queries are thenrepeated with an added ORDER BY clause.

Timings are measured for all.

*Test parameters* to ensure all potential problems mentioned above wereaccounted for:- Pre-setup of the random table ensures all the queries operate on theexact same randomized data.- Tables contain 150 data columns of all typical use types and withvarying data and data lengths (in the case of BLOB and TEXT at least)

- Tables have a Primary Key, but it does not alias the Row_ID
- Tests were repeated for both normal and WITHOUT ROWID tables

- Query field comparisons involved a full set of field types (exceptBLOB columns) and not the primary key- Test-bed using older/slower notebook and sqlitespeed on Windows 1064bit prof with updated latest sqlite3 DLL file - version 3.20.1 Release2017-08-24.



*General Notes*:

- Queries were tested in different orders to avoid caching effects, butthis made no difference - it seems there is no mystery caching when thequeries are so different.- The EXCEPT vs. WHERE NOT IN (...) queries yielded consistently similartimes through a range of parameters, with or without ORDER BY clauses,with or without ROWIDs, so their tests provided no real insight, isprobably both well-optimized already, and I won't refer to them again inthis document.

*Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type querieson tables *with Row_ids*:First important thing to note is that when not involving the primary keyand using these larger (wider) tables, the ORDER BY clause addition madevery little difference (sometimes completely zero difference). JOIN andINTERSECT posted similar times for these larger (wider) tables withINTERSECT consistently winning by a very slight margin.The surprise here was that WHERE IN (...) queries were consistentlyfaster than both JOIN and INTERSECT (again with the added ORDER BY beinginconsequential). Not by a large margin though, 4-5% on average.

*RESULT*: No clear opportunity for optimization exists for the above case.

*Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type querieson tables *WITHOUT Row_ids*:

(This is the full test posted below because it is the one that matters most)

INTERSECT AND WHERE IN (...) queries posted similar times here withWHERE IN (...) being the slightly faster (similar to the abovefindings), but BOTH were a good 10 - 15% faster than on the ROW-IDtables, so WITHOUT ROWID tables seem to have a definite advantage here(it is currently unclear to me why this is).

A troubling test is the JOIN on WITHOUT ROWID tables - it took severalorders of magnitude longer than any other test in the entire experiment.Typical test times posted in the order of between 10 to 200milliseconds. The JOIN query here took over a minute on this test bed -that /must/ be a glaring mistake in the Query planner or match methodsomewhere.

*RESULT*: No clear opportunity for optimization exists for INTERSECT andWHERE IN (...) queries above, but the JOIN query has perhaps a QPdeficiency that needs fixing or a Query execution optimization that ispossible.



*Test Script*:

Note that you can increase / decrease the number of random generatedrows by adjusting the p.rndCount, and adjust the frequency of coincidingrows (2..18 where 2 is highest and 18 lowest coincidence) with thep.rndAddFreq values inserted into the p table. I found 100,000 rows and18 frequency produces test tables with roughly 1000 coinciding hits forthe queries as a good measure.

You will want to place it inside a transaction maybe, turn timers on andpipe the output to a file, it can get quite lengthy.

This posted version of the test script is precisely the one thathighlights the JOIN inefficiency mentioned above. On the test bed used,it posted times in the order of 50-60 milliseconds for the INTERCEPT andWHERE IN (...) queries, and /over a minute/ for the JOIN query. If yousimply remove the "WITHOUT ROWID" specifiers form the two test tables,this entire script will run in under 3 seconds typically. Add the"WITHOUT ROWID" (as below) and the script takes minutes. (I also testedwith older 3.17.0 DLL, it was even slower, but not significantly so).



-- Test Paremeters Table
CREATE TABLE p (
  rndCount INT,
  rndAddFreq INT,
  rndText TEXT
);

-- Test Parameters (and a random text block)
INSERT INTO p(rndCount, rndAddFreq, rndText) VALUES
(100000,
 18,

'Sed ut perspiciatis, unde omnis iste natus error sit voluptatemaccusantium doloremque '|| 'laudantium, totam rem aperiam eaque ipsa, quae ab illo inventoreveritatis et quasi architecto '|| 'beatae vitae dicta sunt, explicabo. Nemo enim ipsam voluptatem, quiavoluptas sit, aspernatur '|| 'aut odit aut fugit, sed quia consequuntur magni dolores eos, quiratione voluptatem sequi '|| 'nesciunt, neque porro quisquam est, qui dolorem ipsum, quia dolor sitamet consectetur '|| 'adipiscing velit, sed quia non numquam [do] eius modi temporainci[di]dunt, ut labore et dolore '||

 'magnam aliquam quaerat voluptatem.'
);

-- Randomized values table & population
CREATE TABLE rnd(ri INTEGER PRIMARY KEY, rr REAL, rl INT);

WITH R(i, rndInt, rndReal, rndLen) AS (
    SELECT 0,0,0.1,0
    UNION ALL

SELECT R.i+1, abs(random() % (10*p.rndCount)), abs(random() %(1000*p.rndCount))/1000.0, abs(random() % 64) + 1

      FROM R,p
     WHERE R.i < (p.rndCount * 2)
)
INSERT OR IGNORE INTO rnd(ri, rr, rl)
SELECT rndInt, rndReal, rndLen
  FROM R
 WHERE R.i > 0;

-- Test Tables with 151 columns of different Types
CREATE TABLE t1 (id INT PRIMARY KEY,
  i00 INT,  r00 REAL,  n00 NUMERIC,  b00 BLOB,  v00 TEXT,
  i01 INT,  r01 REAL,  n01 NUMERIC,  b01 BLOB,  v01 TEXT,
  i02 INT,  r02 REAL,  n02 NUMERIC,  b02 BLOB,  v02 TEXT,
  i03 INT,  r03 REAL,  n03 NUMERIC,  b03 BLOB,  v03 TEXT,
  i04 INT,  r04 REAL,  n04 NUMERIC,  b04 BLOB,  v04 TEXT,
  i05 INT,  r05 REAL,  n05 NUMERIC,  b05 BLOB,  v05 TEXT,
  i06 INT,  r06 REAL,  n06 NUMERIC,  b06 BLOB,  v06 TEXT,
  i07 INT,  r07 REAL,  n07 NUMERIC,  b07 BLOB,  v07 TEXT,
  i08 INT,  r08 REAL,  n08 NUMERIC,  b08 BLOB,  v08 TEXT,
  i09 INT,  r09 REAL,  n09 NUMERIC,  b09 BLOB,  v09 TEXT,
  i10 INT,  r10 REAL,  n10 NUMERIC,  b10 BLOB,  v10 TEXT,
  i11 INT,  r11 REAL,  n11 NUMERIC,  b11 BLOB,  v11 TEXT,
  i12 INT,  r12 REAL,  n12 NUMERIC,  b12 BLOB,  v12 TEXT,
  i13 INT,  r13 REAL,  n13 NUMERIC,  b13 BLOB,  v13 TEXT,
  i14 INT,  r14 REAL,  n14 NUMERIC,  b14 BLOB,  v14 TEXT,
  i15 INT,  r15 REAL,  n15 NUMERIC,  b15 BLOB,  v15 TEXT,
  i16 INT,  r16 REAL,  n16 NUMERIC,  b16 BLOB,  v16 TEXT,
  i17 INT,  r17 REAL,  n17 NUMERIC,  b17 BLOB,  v17 TEXT,
  i18 INT,  r18 REAL,  n18 NUMERIC,  b18 BLOB,  v18 TEXT,
  i19 INT,  r19 REAL,  n19 NUMERIC,  b19 BLOB,  v19 TEXT,
  i20 INT,  r20 REAL,  n20 NUMERIC,  b20 BLOB,  v20 TEXT COLLATE NOCASE,
  i21 INT,  r21 REAL,  n21 NUMERIC,  b21 BLOB,  v21 TEXT COLLATE NOCASE,
  i22 INT,  r22 REAL,  n22 NUMERIC,  b22 BLOB,  v22 TEXT COLLATE NOCASE,
  i23 INT,  r23 REAL,  n23 NUMERIC,  b23 BLOB,  v23 TEXT COLLATE NOCASE,
  i24 INT,  r24 REAL,  n24 NUMERIC,  b24 BLOB,  v24 TEXT COLLATE NOCASE,
  i25 INT,  r25 REAL,  n25 NUMERIC,  b25 BLOB,  v25 TEXT COLLATE BINARY,
  i26 INT,  r26 REAL,  n26 NUMERIC,  b26 BLOB,  v26 TEXT COLLATE BINARY,
  i27 INT,  r27 REAL,  n27 NUMERIC,  b27 BLOB,  v27 TEXT COLLATE BINARY,
  i28 INT,  r28 REAL,  n28 NUMERIC,  b28 BLOB,  v28 TEXT COLLATE BINARY,
  i29 INT,  r29 REAL,  n29 NUMERIC,  b29 BLOB,  v29 TEXT COLLATE BINARY
) WITHOUT ROWID;

CREATE TABLE t2 (id INT PRIMARY KEY,
  i00 INT,  r00 REAL,  n00 NUMERIC,  b00 BLOB,  v00 TEXT,
  i01 INT,  r01 REAL,  n01 NUMERIC,  b01 BLOB,  v01 TEXT,
  i02 INT,  r02 REAL,  n02 NUMERIC,  b02 BLOB,  v02 TEXT,
  i03 INT,  r03 REAL,  n03 NUMERIC,  b03 BLOB,  v03 TEXT,
  i04 INT,  r04 REAL,  n04 NUMERIC,  b04 BLOB,  v04 TEXT,
  i05 INT,  r05 REAL,  n05 NUMERIC,  b05 BLOB,  v05 TEXT,
  i06 INT,  r06 REAL,  n06 NUMERIC,  b06 BLOB,  v06 TEXT,
  i07 INT,  r07 REAL,  n07 NUMERIC,  b07 BLOB,  v07 TEXT,
  i08 INT,  r08 REAL,  n08 NUMERIC,  b08 BLOB,  v08 TEXT,
  i09 INT,  r09 REAL,  n09 NUMERIC,  b09 BLOB,  v09 TEXT,
  i10 INT,  r10 REAL,  n10 NUMERIC,  b10 BLOB,  v10 TEXT,
  i11 INT,  r11 REAL,  n11 NUMERIC,  b11 BLOB,  v11 TEXT,
  i12 INT,  r12 REAL,  n12 NUMERIC,  b12 BLOB,  v12 TEXT,
  i13 INT,  r13 REAL,  n13 NUMERIC,  b13 BLOB,  v13 TEXT,
  i14 INT,  r14 REAL,  n14 NUMERIC,  b14 BLOB,  v14 TEXT,
  i15 INT,  r15 REAL,  n15 NUMERIC,  b15 BLOB,  v15 TEXT,
  i16 INT,  r16 REAL,  n16 NUMERIC,  b16 BLOB,  v16 TEXT,
  i17 INT,  r17 REAL,  n17 NUMERIC,  b17 BLOB,  v17 TEXT,
  i18 INT,  r18 REAL,  n18 NUMERIC,  b18 BLOB,  v18 TEXT,
  i19 INT,  r19 REAL,  n19 NUMERIC,  b19 BLOB,  v19 TEXT,
  i20 INT,  r20 REAL,  n20 NUMERIC,  b20 BLOB,  v20 TEXT COLLATE NOCASE,
  i21 INT,  r21 REAL,  n21 NUMERIC,  b21 BLOB,  v21 TEXT COLLATE NOCASE,
  i22 INT,  r22 REAL,  n22 NUMERIC,  b22 BLOB,  v22 TEXT COLLATE NOCASE,
  i23 INT,  r23 REAL,  n23 NUMERIC,  b23 BLOB,  v23 TEXT COLLATE NOCASE,
  i24 INT,  r24 REAL,  n24 NUMERIC,  b24 BLOB,  v24 TEXT COLLATE NOCASE,
  i25 INT,  r25 REAL,  n25 NUMERIC,  b25 BLOB,  v25 TEXT COLLATE BINARY,
  i26 INT,  r26 REAL,  n26 NUMERIC,  b26 BLOB,  v26 TEXT COLLATE BINARY,
  i27 INT,  r27 REAL,  n27 NUMERIC,  b27 BLOB,  v27 TEXT COLLATE BINARY,
  i28 INT,  r28 REAL,  n28 NUMERIC,  b28 BLOB,  v28 TEXT COLLATE BINARY,
  i29 INT,  r29 REAL,  n29 NUMERIC,  b29 BLOB,  v29 TEXT COLLATE BINARY
) WITHOUT ROWID;

-- Populate the Test tables with Randomized data
WITH R(vi, vr, vt, vn, vb) AS (
    SELECT rnd.ri,
       rnd.rr,
       substr(p.rndText,1,rnd.rl),
       date('1901-01-01','+'||CAST((rnd.ri % 5000) AS TEXT)||' days'),
       randomblob(rnd.rl)
      FROM rnd,p
)
INSERT OR IGNORE INTO t1(id,
  i00, r00, n00, b00, v00,
  i01, r01, n01, b01, v01,
  i02, r02, n02, b02, v02,
  i03, r03, n03, b03, v03,
  i04, r04, n04, b04, v04,
  i05, r05, n05, b05, v05,
  i06, r06, n06, b06, v06,
  i07, r07, n07, b07, v07,
  i08, r08, n08, b08, v08,
  i09, r09, n09, b09, v09,
  i10, r10, n10, b10, v10,
  i11, r11, n11, b11, v11,
  i12, r12, n12, b12, v12,
  i13, r13, n13, b13, v13,
  i14, r14, n14, b14, v14,
  i15, r15, n15, b15, v15,
  i16, r16, n16, b16, v16,
  i17, r17, n17, b17, v17,
  i18, r18, n18, b18, v18,
  i19, r19, n19, b19, v19,
  i20, r20, n20, b20, v20,
  i21, r21, n21, b21, v21,
  i22, r22, n22, b22, v22,
  i23, r23, n23, b23, v23,
  i24, r24, n24, b24, v24,
  i25, r25, n25, b25, v25,
  i26, r26, n26, b26, v26,
  i27, r27, n27, b27, v27,
  i28, r28, n28, b28, v28,
  i29, r29, n29, b29, v29
) SELECT vi,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt
  FROM R,p
 WHERE ((vi / 100) % p.rndAddFreq) = 1
;

WITH R(vi, vr, vt, vn, vb) AS (
    SELECT rnd.ri,
       rnd.rr,
       substr(p.rndText,1,rnd.rl),
       date('1901-01-01','+'||CAST((rnd.ri % 5000) AS TEXT)||' days'),
       randomblob(rnd.rl)
      FROM rnd,p
)
INSERT OR IGNORE INTO t2(id,
  i00, r00, n00, b00, v00,
  i01, r01, n01, b01, v01,
  i02, r02, n02, b02, v02,
  i03, r03, n03, b03, v03,
  i04, r04, n04, b04, v04,
  i05, r05, n05, b05, v05,
  i06, r06, n06, b06, v06,
  i07, r07, n07, b07, v07,
  i08, r08, n08, b08, v08,
  i09, r09, n09, b09, v09,
  i10, r10, n10, b10, v10,
  i11, r11, n11, b11, v11,
  i12, r12, n12, b12, v12,
  i13, r13, n13, b13, v13,
  i14, r14, n14, b14, v14,
  i15, r15, n15, b15, v15,
  i16, r16, n16, b16, v16,
  i17, r17, n17, b17, v17,
  i18, r18, n18, b18, v18,
  i19, r19, n19, b19, v19,
  i20, r20, n20, b20, v20,
  i21, r21, n21, b21, v21,
  i22, r22, n22, b22, v22,
  i23, r23, n23, b23, v23,
  i24, r24, n24, b24, v24,
  i25, r25, n25, b25, v25,
  i26, r26, n26, b26, v26,
  i27, r27, n27, b27, v27,
  i28, r28, n28, b28, v28,
  i29, r29, n29, b29, v29
) SELECT vi,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt
  FROM R,p
 WHERE ((vi / 10) % p.rndAddFreq) = 1
;

-- Simply make the chosen parameters visible
SELECT * FROM p;


-- Test JOIN method  -- This is the baddie.
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1

JOIN t2 ON t2.i00 = t1.i00 AND t2.r00 = t1.r00 AND t2.n00 = t1.n00AND t2.v00 = t1.v00

;

-- Test INTERSECT method
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
INTERSECT
SELECT t2.i00, t2.r00, t2.n00, t2.v00
  FROM t2
;

-- Test WHERE IN (...) method
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
WHERE t1.i00 IN (SELECT t2.i00 FROM t2)
;

-- Test JOIN with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1

JOIN t2 ON t2.i00 = t1.i00 AND t2.r00 = t1.r00 AND t2.n00 = t1.n00AND t2.v00 = t1.v00

 ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Test INTERSECT with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
INTERSECT
SELECT t2.i00, t2.r00, t2.n00, t2.v00
  FROM t2
 ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Test WHERE IN (...) with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
WHERE t1.i00 IN (SELECT t2.i00 FROM t2)
 ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Display simply the COUNTs from our random and test tables and thematching

-- values, to ensure the test is valid.
WITH RowCounts(Entity, RowCnt) AS (
    SELECT 'Randomizer', COUNT(*) FROM rnd
    UNION ALL
    SELECT 'Test Table t1', COUNT(*) FROM t1
    UNION ALL
    SELECT 'Test Table t2', COUNT(*) FROM t2
    UNION ALL

SELECT 't1 : t2 Matches', COUNT(*) FROM t2 WHERE t2.i00 IN (SELECTt1.i00 FROM t1)

)
SELECT * FROM RowCounts
;

-- CleanUp
DROP TABLE rnd;
DROP TABLE t1;
DROP TABLE t2;
DROP TABLE p;



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] JOIN vs. INTERSECT vs. WHERE IN (...) - speed and efficiency differences

Reply via email to