Full tests completed with findings ranging from less interesting to
exposing a rather significant inefficiency in SQLite.
I won't post all the tests because that would take far too much space,
in stead I will simply discuss the experiment and findings and post the
test script so that anyone can verify the findings.
*Problem*:
Earlier tests suggested that, in a query where the common rows between
two tables need to be found, JOIN is faster than INTERSECT when there
are very few matches with efficiencies evening out as the number of
matches increase. It suggested, more pertinently, that an ORDER BY
clause when added to both types of queries improved all-round
performance by a significant margin (moreso in JOIN's case).
After examination of the tests, comments suggested that a lot of the
efficiency differences may be due to:
- row_id involvement,
- checks only done on Primary key values as opposed to common row
values, and
- test tables having one or two columns only not matching any real-world
scenario which made the findings irrelevant.
It was also pointed out that apart from JOIN and INTERSECT, a WHERE IN
(SELECT FROM ...) clause could achieve the same query result set and
should form part of the tests.
*Purpose*:
This experiment and tests aimed to establish the differences in
efficiency for queries with the same results that could be posed in
different ways to the Query Planner in specifically SQLite.
The possible ways to ask the same question that was tested included
similar queries (producing the same result set) when finding the common
set of rows between two tables using JOIN, INTERSECT and WHERE IN (...),
and secondly similar queries finding the exclusive set of rows between
two tables using OUTER JOIN, EXCEPT and WHERE NOT IN (...).
[SQLite doesn't implement a full outer join so that specific test was
restricted to comparing EXCEPT vs. WHERE NOT IN (...)]
*Method*:
A simple SQL script that first establishes a table "p" with parameters
that dictate the data shape added to the tests. Next a random value
table "rnd" is set up with the number of rows and randomness scope
dictated by table "p" and then two test tables "t1" and "t2" with each
151 columns (1 Key and 30 each of 5 data types - INT, REAL, NUMERIC
(Dates), BLOB and TEXT) which gets populated with randomly selected rows
from the "rnd" table so that many rows appear only in either table, and
some rows appear in both (coincidence).
The total number of rows to choose from is dictated by the "p" table's
"rndCount" field and the approximate percentage of rows that appear in
both tables is dictated by the "rndAddFreq" (random row adding
frequency) field.
Up to here the tests are all similar, except that for the EXCEPT / JOIN
WHERE NOT IN (...) a much higher frequency of coincidence between the
table rows were selected since the output count is opposite to the other
tests.
The queries designed to produce the exact same output using the
different methods discussed above are run in turn. The queries are then
repeated with an added ORDER BY clause.
Timings are measured for all.
*Test parameters* to ensure all potential problems mentioned above were
accounted for:
- Pre-setup of the random table ensures all the queries operate on the
exact same randomized data.
- Tables contain 150 data columns of all typical use types and with
varying data and data lengths (in the case of BLOB and TEXT at least)
- Tables have a Primary Key, but it does not alias the Row_ID
- Tests were repeated for both normal and WITHOUT ROWID tables
- Query field comparisons involved a full set of field types (except
BLOB columns) and not the primary key
- Test-bed using older/slower notebook and sqlitespeed on Windows 10
64bit prof with updated latest sqlite3 DLL file - version 3.20.1 Release
2017-08-24.
*General Notes*:
- Queries were tested in different orders to avoid caching effects, but
this made no difference - it seems there is no mystery caching when the
queries are so different.
- The EXCEPT vs. WHERE NOT IN (...) queries yielded consistently similar
times through a range of parameters, with or without ORDER BY clauses,
with or without ROWIDs, so their tests provided no real insight, is
probably both well-optimized already, and I won't refer to them again in
this document.
*Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type queries
on tables *with Row_ids*:
First important thing to note is that when not involving the primary key
and using these larger (wider) tables, the ORDER BY clause addition made
very little difference (sometimes completely zero difference). JOIN and
INTERSECT posted similar times for these larger (wider) tables with
INTERSECT consistently winning by a very slight margin.
The surprise here was that WHERE IN (...) queries were consistently
faster than both JOIN and INTERSECT (again with the added ORDER BY being
inconsequential). Not by a large margin though, 4-5% on average.
*RESULT*: No clear opportunity for optimization exists for the above case.
*Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type queries
on tables *WITHOUT Row_ids*:
(This is the full test posted below because it is the one that matters most)
INTERSECT AND WHERE IN (...) queries posted similar times here with
WHERE IN (...) being the slightly faster (similar to the above
findings), but BOTH were a good 10 - 15% faster than on the ROW-ID
tables, so WITHOUT ROWID tables seem to have a definite advantage here
(it is currently unclear to me why this is).
A troubling test is the JOIN on WITHOUT ROWID tables - it took several
orders of magnitude longer than any other test in the entire experiment.
Typical test times posted in the order of between 10 to 200
milliseconds. The JOIN query here took over a minute on this test bed -
that /must/ be a glaring mistake in the Query planner or match method
somewhere.
*RESULT*: No clear opportunity for optimization exists for INTERSECT and
WHERE IN (...) queries above, but the JOIN query has perhaps a QP
deficiency that needs fixing or a Query execution optimization that is
possible.
*Test Script*:
Note that you can increase / decrease the number of random generated
rows by adjusting the p.rndCount, and adjust the frequency of coinciding
rows (2..18 where 2 is highest and 18 lowest coincidence) with the
p.rndAddFreq values inserted into the p table. I found 100,000 rows and
18 frequency produces test tables with roughly 1000 coinciding hits for
the queries as a good measure.
You will want to place it inside a transaction maybe, turn timers on and
pipe the output to a file, it can get quite lengthy.
This posted version of the test script is precisely the one that
highlights the JOIN inefficiency mentioned above. On the test bed used,
it posted times in the order of 50-60 milliseconds for the INTERCEPT and
WHERE IN (...) queries, and /over a minute/ for the JOIN query. If you
simply remove the "WITHOUT ROWID" specifiers form the two test tables,
this entire script will run in under 3 seconds typically. Add the
"WITHOUT ROWID" (as below) and the script takes minutes. (I also tested
with older 3.17.0 DLL, it was even slower, but not significantly so).
-- Test Paremeters Table
CREATE TABLE p (
rndCount INT,
rndAddFreq INT,
rndText TEXT
);
-- Test Parameters (and a random text block)
INSERT INTO p(rndCount, rndAddFreq, rndText) VALUES
(100000,
18,
'Sed ut perspiciatis, unde omnis iste natus error sit voluptatem
accusantium doloremque '||
'laudantium, totam rem aperiam eaque ipsa, quae ab illo inventore
veritatis et quasi architecto '||
'beatae vitae dicta sunt, explicabo. Nemo enim ipsam voluptatem, quia
voluptas sit, aspernatur '||
'aut odit aut fugit, sed quia consequuntur magni dolores eos, qui
ratione voluptatem sequi '||
'nesciunt, neque porro quisquam est, qui dolorem ipsum, quia dolor sit
amet consectetur '||
'adipiscing velit, sed quia non numquam [do] eius modi tempora
inci[di]dunt, ut labore et dolore '||
'magnam aliquam quaerat voluptatem.'
);
-- Randomized values table & population
CREATE TABLE rnd(ri INTEGER PRIMARY KEY, rr REAL, rl INT);
WITH R(i, rndInt, rndReal, rndLen) AS (
SELECT 0,0,0.1,0
UNION ALL
SELECT R.i+1, abs(random() % (10*p.rndCount)), abs(random() %
(1000*p.rndCount))/1000.0, abs(random() % 64) + 1
FROM R,p
WHERE R.i < (p.rndCount * 2)
)
INSERT OR IGNORE INTO rnd(ri, rr, rl)
SELECT rndInt, rndReal, rndLen
FROM R
WHERE R.i > 0;
-- Test Tables with 151 columns of different Types
CREATE TABLE t1 (id INT PRIMARY KEY,
i00 INT, r00 REAL, n00 NUMERIC, b00 BLOB, v00 TEXT,
i01 INT, r01 REAL, n01 NUMERIC, b01 BLOB, v01 TEXT,
i02 INT, r02 REAL, n02 NUMERIC, b02 BLOB, v02 TEXT,
i03 INT, r03 REAL, n03 NUMERIC, b03 BLOB, v03 TEXT,
i04 INT, r04 REAL, n04 NUMERIC, b04 BLOB, v04 TEXT,
i05 INT, r05 REAL, n05 NUMERIC, b05 BLOB, v05 TEXT,
i06 INT, r06 REAL, n06 NUMERIC, b06 BLOB, v06 TEXT,
i07 INT, r07 REAL, n07 NUMERIC, b07 BLOB, v07 TEXT,
i08 INT, r08 REAL, n08 NUMERIC, b08 BLOB, v08 TEXT,
i09 INT, r09 REAL, n09 NUMERIC, b09 BLOB, v09 TEXT,
i10 INT, r10 REAL, n10 NUMERIC, b10 BLOB, v10 TEXT,
i11 INT, r11 REAL, n11 NUMERIC, b11 BLOB, v11 TEXT,
i12 INT, r12 REAL, n12 NUMERIC, b12 BLOB, v12 TEXT,
i13 INT, r13 REAL, n13 NUMERIC, b13 BLOB, v13 TEXT,
i14 INT, r14 REAL, n14 NUMERIC, b14 BLOB, v14 TEXT,
i15 INT, r15 REAL, n15 NUMERIC, b15 BLOB, v15 TEXT,
i16 INT, r16 REAL, n16 NUMERIC, b16 BLOB, v16 TEXT,
i17 INT, r17 REAL, n17 NUMERIC, b17 BLOB, v17 TEXT,
i18 INT, r18 REAL, n18 NUMERIC, b18 BLOB, v18 TEXT,
i19 INT, r19 REAL, n19 NUMERIC, b19 BLOB, v19 TEXT,
i20 INT, r20 REAL, n20 NUMERIC, b20 BLOB, v20 TEXT COLLATE NOCASE,
i21 INT, r21 REAL, n21 NUMERIC, b21 BLOB, v21 TEXT COLLATE NOCASE,
i22 INT, r22 REAL, n22 NUMERIC, b22 BLOB, v22 TEXT COLLATE NOCASE,
i23 INT, r23 REAL, n23 NUMERIC, b23 BLOB, v23 TEXT COLLATE NOCASE,
i24 INT, r24 REAL, n24 NUMERIC, b24 BLOB, v24 TEXT COLLATE NOCASE,
i25 INT, r25 REAL, n25 NUMERIC, b25 BLOB, v25 TEXT COLLATE BINARY,
i26 INT, r26 REAL, n26 NUMERIC, b26 BLOB, v26 TEXT COLLATE BINARY,
i27 INT, r27 REAL, n27 NUMERIC, b27 BLOB, v27 TEXT COLLATE BINARY,
i28 INT, r28 REAL, n28 NUMERIC, b28 BLOB, v28 TEXT COLLATE BINARY,
i29 INT, r29 REAL, n29 NUMERIC, b29 BLOB, v29 TEXT COLLATE BINARY
) WITHOUT ROWID;
CREATE TABLE t2 (id INT PRIMARY KEY,
i00 INT, r00 REAL, n00 NUMERIC, b00 BLOB, v00 TEXT,
i01 INT, r01 REAL, n01 NUMERIC, b01 BLOB, v01 TEXT,
i02 INT, r02 REAL, n02 NUMERIC, b02 BLOB, v02 TEXT,
i03 INT, r03 REAL, n03 NUMERIC, b03 BLOB, v03 TEXT,
i04 INT, r04 REAL, n04 NUMERIC, b04 BLOB, v04 TEXT,
i05 INT, r05 REAL, n05 NUMERIC, b05 BLOB, v05 TEXT,
i06 INT, r06 REAL, n06 NUMERIC, b06 BLOB, v06 TEXT,
i07 INT, r07 REAL, n07 NUMERIC, b07 BLOB, v07 TEXT,
i08 INT, r08 REAL, n08 NUMERIC, b08 BLOB, v08 TEXT,
i09 INT, r09 REAL, n09 NUMERIC, b09 BLOB, v09 TEXT,
i10 INT, r10 REAL, n10 NUMERIC, b10 BLOB, v10 TEXT,
i11 INT, r11 REAL, n11 NUMERIC, b11 BLOB, v11 TEXT,
i12 INT, r12 REAL, n12 NUMERIC, b12 BLOB, v12 TEXT,
i13 INT, r13 REAL, n13 NUMERIC, b13 BLOB, v13 TEXT,
i14 INT, r14 REAL, n14 NUMERIC, b14 BLOB, v14 TEXT,
i15 INT, r15 REAL, n15 NUMERIC, b15 BLOB, v15 TEXT,
i16 INT, r16 REAL, n16 NUMERIC, b16 BLOB, v16 TEXT,
i17 INT, r17 REAL, n17 NUMERIC, b17 BLOB, v17 TEXT,
i18 INT, r18 REAL, n18 NUMERIC, b18 BLOB, v18 TEXT,
i19 INT, r19 REAL, n19 NUMERIC, b19 BLOB, v19 TEXT,
i20 INT, r20 REAL, n20 NUMERIC, b20 BLOB, v20 TEXT COLLATE NOCASE,
i21 INT, r21 REAL, n21 NUMERIC, b21 BLOB, v21 TEXT COLLATE NOCASE,
i22 INT, r22 REAL, n22 NUMERIC, b22 BLOB, v22 TEXT COLLATE NOCASE,
i23 INT, r23 REAL, n23 NUMERIC, b23 BLOB, v23 TEXT COLLATE NOCASE,
i24 INT, r24 REAL, n24 NUMERIC, b24 BLOB, v24 TEXT COLLATE NOCASE,
i25 INT, r25 REAL, n25 NUMERIC, b25 BLOB, v25 TEXT COLLATE BINARY,
i26 INT, r26 REAL, n26 NUMERIC, b26 BLOB, v26 TEXT COLLATE BINARY,
i27 INT, r27 REAL, n27 NUMERIC, b27 BLOB, v27 TEXT COLLATE BINARY,
i28 INT, r28 REAL, n28 NUMERIC, b28 BLOB, v28 TEXT COLLATE BINARY,
i29 INT, r29 REAL, n29 NUMERIC, b29 BLOB, v29 TEXT COLLATE BINARY
) WITHOUT ROWID;
-- Populate the Test tables with Randomized data
WITH R(vi, vr, vt, vn, vb) AS (
SELECT rnd.ri,
rnd.rr,
substr(p.rndText,1,rnd.rl),
date('1901-01-01','+'||CAST((rnd.ri % 5000) AS TEXT)||' days'),
randomblob(rnd.rl)
FROM rnd,p
)
INSERT OR IGNORE INTO t1(id,
i00, r00, n00, b00, v00,
i01, r01, n01, b01, v01,
i02, r02, n02, b02, v02,
i03, r03, n03, b03, v03,
i04, r04, n04, b04, v04,
i05, r05, n05, b05, v05,
i06, r06, n06, b06, v06,
i07, r07, n07, b07, v07,
i08, r08, n08, b08, v08,
i09, r09, n09, b09, v09,
i10, r10, n10, b10, v10,
i11, r11, n11, b11, v11,
i12, r12, n12, b12, v12,
i13, r13, n13, b13, v13,
i14, r14, n14, b14, v14,
i15, r15, n15, b15, v15,
i16, r16, n16, b16, v16,
i17, r17, n17, b17, v17,
i18, r18, n18, b18, v18,
i19, r19, n19, b19, v19,
i20, r20, n20, b20, v20,
i21, r21, n21, b21, v21,
i22, r22, n22, b22, v22,
i23, r23, n23, b23, v23,
i24, r24, n24, b24, v24,
i25, r25, n25, b25, v25,
i26, r26, n26, b26, v26,
i27, r27, n27, b27, v27,
i28, r28, n28, b28, v28,
i29, r29, n29, b29, v29
) SELECT vi,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt
FROM R,p
WHERE ((vi / 100) % p.rndAddFreq) = 1
;
WITH R(vi, vr, vt, vn, vb) AS (
SELECT rnd.ri,
rnd.rr,
substr(p.rndText,1,rnd.rl),
date('1901-01-01','+'||CAST((rnd.ri % 5000) AS TEXT)||' days'),
randomblob(rnd.rl)
FROM rnd,p
)
INSERT OR IGNORE INTO t2(id,
i00, r00, n00, b00, v00,
i01, r01, n01, b01, v01,
i02, r02, n02, b02, v02,
i03, r03, n03, b03, v03,
i04, r04, n04, b04, v04,
i05, r05, n05, b05, v05,
i06, r06, n06, b06, v06,
i07, r07, n07, b07, v07,
i08, r08, n08, b08, v08,
i09, r09, n09, b09, v09,
i10, r10, n10, b10, v10,
i11, r11, n11, b11, v11,
i12, r12, n12, b12, v12,
i13, r13, n13, b13, v13,
i14, r14, n14, b14, v14,
i15, r15, n15, b15, v15,
i16, r16, n16, b16, v16,
i17, r17, n17, b17, v17,
i18, r18, n18, b18, v18,
i19, r19, n19, b19, v19,
i20, r20, n20, b20, v20,
i21, r21, n21, b21, v21,
i22, r22, n22, b22, v22,
i23, r23, n23, b23, v23,
i24, r24, n24, b24, v24,
i25, r25, n25, b25, v25,
i26, r26, n26, b26, v26,
i27, r27, n27, b27, v27,
i28, r28, n28, b28, v28,
i29, r29, n29, b29, v29
) SELECT vi,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt,
vi, vr, vn, vb, vt
FROM R,p
WHERE ((vi / 10) % p.rndAddFreq) = 1
;
-- Simply make the chosen parameters visible
SELECT * FROM p;
-- Test JOIN method -- This is the baddie.
SELECT t1.i00, t1.r00, t1.n00, t1.v00
FROM t1
JOIN t2 ON t2.i00 = t1.i00 AND t2.r00 = t1.r00 AND t2.n00 = t1.n00
AND t2.v00 = t1.v00
;
-- Test INTERSECT method
SELECT t1.i00, t1.r00, t1.n00, t1.v00
FROM t1
INTERSECT
SELECT t2.i00, t2.r00, t2.n00, t2.v00
FROM t2
;
-- Test WHERE IN (...) method
SELECT t1.i00, t1.r00, t1.n00, t1.v00
FROM t1
WHERE t1.i00 IN (SELECT t2.i00 FROM t2)
;
-- Test JOIN with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
FROM t1
JOIN t2 ON t2.i00 = t1.i00 AND t2.r00 = t1.r00 AND t2.n00 = t1.n00
AND t2.v00 = t1.v00
ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;
-- Test INTERSECT with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
FROM t1
INTERSECT
SELECT t2.i00, t2.r00, t2.n00, t2.v00
FROM t2
ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;
-- Test WHERE IN (...) with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
FROM t1
WHERE t1.i00 IN (SELECT t2.i00 FROM t2)
ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;
-- Display simply the COUNTs from our random and test tables and the
matching
-- values, to ensure the test is valid.
WITH RowCounts(Entity, RowCnt) AS (
SELECT 'Randomizer', COUNT(*) FROM rnd
UNION ALL
SELECT 'Test Table t1', COUNT(*) FROM t1
UNION ALL
SELECT 'Test Table t2', COUNT(*) FROM t2
UNION ALL
SELECT 't1 : t2 Matches', COUNT(*) FROM t2 WHERE t2.i00 IN (SELECT
t1.i00 FROM t1)
)
SELECT * FROM RowCounts
;
-- CleanUp
DROP TABLE rnd;
DROP TABLE t1;
DROP TABLE t2;
DROP TABLE p;
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users