Full tests completed with findings ranging from less interesting to exposing a rather significant inefficiency in SQLite.

I won't post all the tests because that would take far too much space, in stead I will simply discuss the experiment and findings and post the test script so that anyone can verify the findings.


*Problem*:
Earlier tests suggested that, in a query where the common rows between two tables need to be found, JOIN is faster than INTERSECT when there are very few matches with efficiencies evening out as the number of matches increase. It suggested, more pertinently, that an ORDER BY clause when added to both types of queries improved all-round performance by a significant margin (moreso in JOIN's case). After examination of the tests, comments suggested that a lot of the efficiency differences may be due to:
- row_id involvement,
- checks only done on Primary key values as opposed to common row values, and - test tables having one or two columns only not matching any real-world scenario which made the findings irrelevant. It was also pointed out that apart from JOIN and INTERSECT, a WHERE IN (SELECT FROM ...) clause could achieve the same query result set and should form part of the tests.


*Purpose*:
This experiment and tests aimed to establish the differences in efficiency for queries with the same results that could be posed in different ways to the Query Planner in specifically SQLite. The possible ways to ask the same question that was tested included similar queries (producing the same result set) when finding the common set of rows between two tables using JOIN, INTERSECT and WHERE IN (...), and secondly similar queries finding the exclusive set of rows between two tables using OUTER JOIN, EXCEPT and WHERE NOT IN (...). [SQLite doesn't implement a full outer join so that specific test was restricted to comparing EXCEPT vs. WHERE NOT IN (...)]


*Method*:
A simple SQL script that first establishes a table "p" with parameters that dictate the data shape added to the tests. Next a random value table "rnd" is set up with the number of rows and randomness scope dictated by table "p" and then two test tables "t1" and "t2" with each 151 columns (1 Key and 30 each of 5 data types - INT, REAL, NUMERIC (Dates), BLOB and TEXT) which gets populated with randomly selected rows from the "rnd" table so that many rows appear only in either table, and some rows appear in both (coincidence).

The total number of rows to  choose from is dictated by the "p" table's "rndCount" field and the approximate percentage of rows that appear in both tables is dictated by the "rndAddFreq" (random row adding frequency) field. Up to here the tests are all similar, except that for the EXCEPT / JOIN WHERE NOT IN (...) a much higher frequency of coincidence between the table rows were selected since the output count is opposite to the other tests. The queries designed to produce the exact same output using the different methods discussed above are run in turn. The queries are then repeated with an added ORDER BY clause.
Timings are measured for all.


*Test parameters* to ensure all potential problems mentioned above were accounted for: - Pre-setup of the random table ensures all the queries operate on the exact same randomized data. - Tables contain 150 data columns of all typical use types and with varying data and data lengths (in the case of BLOB and TEXT at least)
- Tables have a Primary Key, but it does not alias the Row_ID
- Tests were repeated for both normal and WITHOUT ROWID tables
- Query field comparisons involved a full set of field types (except BLOB columns) and not the primary key - Test-bed using older/slower notebook and sqlitespeed on Windows 10 64bit prof with updated latest sqlite3 DLL file - version 3.20.1 Release 2017-08-24.


*General Notes*:
- Queries were tested in different orders to avoid caching effects, but this made no difference - it seems there is no mystery caching when the queries are so different. - The EXCEPT vs. WHERE NOT IN (...) queries yielded consistently similar times through a range of parameters, with or without ORDER BY clauses, with or without ROWIDs, so their tests provided no real insight, is probably both well-optimized already, and I won't refer to them again in this document.


*Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type queries on tables *with Row_ids*: First important thing to note is that when not involving the primary key and using these larger (wider) tables, the ORDER BY clause addition made very little difference (sometimes completely zero difference). JOIN and INTERSECT posted similar times for these larger (wider) tables with INTERSECT consistently winning by a very slight margin. The surprise here was that WHERE IN (...) queries were consistently faster than both JOIN and INTERSECT (again with the added ORDER BY being inconsequential). Not by a large margin though, 4-5% on average.
*RESULT*: No clear opportunity for optimization exists for the above case.


*Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type queries on tables *WITHOUT Row_ids*:
(This is the full test posted below because it is the one that matters most)
INTERSECT AND WHERE IN (...) queries posted similar times here with WHERE IN (...) being the slightly faster (similar to the above findings), but BOTH were a good 10 - 15% faster than on the ROW-ID tables, so WITHOUT ROWID tables seem to have a definite advantage here (it is currently unclear to me why this is).

A troubling test is the JOIN on WITHOUT ROWID tables - it took several orders of magnitude longer than any other test in the entire experiment. Typical test times posted in the order of between 10 to 200 milliseconds. The JOIN query here took over a minute on this test bed - that /must/ be a glaring mistake in the Query planner or match method somewhere.

*RESULT*: No clear opportunity for optimization exists for INTERSECT and WHERE IN (...) queries above, but the JOIN query has perhaps a QP deficiency that needs fixing or a Query execution optimization that is possible.


*Test Script*:
Note that you can increase / decrease the number of random generated rows by adjusting the p.rndCount, and adjust the frequency of coinciding rows (2..18 where 2 is highest and 18 lowest coincidence) with the p.rndAddFreq values inserted into the p table. I found 100,000 rows and 18 frequency produces test tables with roughly 1000 coinciding hits for the queries as a good measure.

You will want to place it inside a transaction maybe, turn timers on and pipe the output to a file, it can get quite lengthy.

This posted version of the test script is precisely the one that highlights the JOIN inefficiency mentioned above. On the test bed used, it posted times in the order of 50-60 milliseconds for the INTERCEPT and WHERE IN (...) queries, and /over a minute/ for the JOIN query. If you simply remove the "WITHOUT ROWID" specifiers form the two test tables, this entire script will run in under 3 seconds typically. Add the "WITHOUT ROWID" (as below) and the script takes minutes. (I also tested with older 3.17.0 DLL, it was even slower, but not significantly so).


-- Test Paremeters Table
CREATE TABLE p (
  rndCount INT,
  rndAddFreq INT,
  rndText TEXT
);

-- Test Parameters (and a random text block)
INSERT INTO p(rndCount, rndAddFreq, rndText) VALUES
(100000,
 18,
 'Sed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium doloremque '||  'laudantium, totam rem aperiam eaque ipsa, quae ab illo inventore veritatis et quasi architecto '||  'beatae vitae dicta sunt, explicabo. Nemo enim ipsam voluptatem, quia voluptas sit, aspernatur '||  'aut odit aut fugit, sed quia consequuntur magni dolores eos, qui ratione voluptatem sequi '||  'nesciunt, neque porro quisquam est, qui dolorem ipsum, quia dolor sit amet consectetur '||  'adipiscing velit, sed quia non numquam [do] eius modi tempora inci[di]dunt, ut labore et dolore '||
 'magnam aliquam quaerat voluptatem.'
);

-- Randomized values table & population
CREATE TABLE rnd(ri INTEGER PRIMARY KEY, rr REAL, rl INT);

WITH R(i, rndInt, rndReal, rndLen) AS (
    SELECT 0,0,0.1,0
    UNION ALL
    SELECT R.i+1, abs(random() % (10*p.rndCount)), abs(random() % (1000*p.rndCount))/1000.0, abs(random() % 64) + 1
      FROM R,p
     WHERE R.i < (p.rndCount * 2)
)
INSERT OR IGNORE INTO rnd(ri, rr, rl)
SELECT rndInt, rndReal, rndLen
  FROM R
 WHERE R.i > 0;

-- Test Tables with 151 columns of different Types
CREATE TABLE t1 (id INT PRIMARY KEY,
  i00 INT,  r00 REAL,  n00 NUMERIC,  b00 BLOB,  v00 TEXT,
  i01 INT,  r01 REAL,  n01 NUMERIC,  b01 BLOB,  v01 TEXT,
  i02 INT,  r02 REAL,  n02 NUMERIC,  b02 BLOB,  v02 TEXT,
  i03 INT,  r03 REAL,  n03 NUMERIC,  b03 BLOB,  v03 TEXT,
  i04 INT,  r04 REAL,  n04 NUMERIC,  b04 BLOB,  v04 TEXT,
  i05 INT,  r05 REAL,  n05 NUMERIC,  b05 BLOB,  v05 TEXT,
  i06 INT,  r06 REAL,  n06 NUMERIC,  b06 BLOB,  v06 TEXT,
  i07 INT,  r07 REAL,  n07 NUMERIC,  b07 BLOB,  v07 TEXT,
  i08 INT,  r08 REAL,  n08 NUMERIC,  b08 BLOB,  v08 TEXT,
  i09 INT,  r09 REAL,  n09 NUMERIC,  b09 BLOB,  v09 TEXT,
  i10 INT,  r10 REAL,  n10 NUMERIC,  b10 BLOB,  v10 TEXT,
  i11 INT,  r11 REAL,  n11 NUMERIC,  b11 BLOB,  v11 TEXT,
  i12 INT,  r12 REAL,  n12 NUMERIC,  b12 BLOB,  v12 TEXT,
  i13 INT,  r13 REAL,  n13 NUMERIC,  b13 BLOB,  v13 TEXT,
  i14 INT,  r14 REAL,  n14 NUMERIC,  b14 BLOB,  v14 TEXT,
  i15 INT,  r15 REAL,  n15 NUMERIC,  b15 BLOB,  v15 TEXT,
  i16 INT,  r16 REAL,  n16 NUMERIC,  b16 BLOB,  v16 TEXT,
  i17 INT,  r17 REAL,  n17 NUMERIC,  b17 BLOB,  v17 TEXT,
  i18 INT,  r18 REAL,  n18 NUMERIC,  b18 BLOB,  v18 TEXT,
  i19 INT,  r19 REAL,  n19 NUMERIC,  b19 BLOB,  v19 TEXT,
  i20 INT,  r20 REAL,  n20 NUMERIC,  b20 BLOB,  v20 TEXT COLLATE NOCASE,
  i21 INT,  r21 REAL,  n21 NUMERIC,  b21 BLOB,  v21 TEXT COLLATE NOCASE,
  i22 INT,  r22 REAL,  n22 NUMERIC,  b22 BLOB,  v22 TEXT COLLATE NOCASE,
  i23 INT,  r23 REAL,  n23 NUMERIC,  b23 BLOB,  v23 TEXT COLLATE NOCASE,
  i24 INT,  r24 REAL,  n24 NUMERIC,  b24 BLOB,  v24 TEXT COLLATE NOCASE,
  i25 INT,  r25 REAL,  n25 NUMERIC,  b25 BLOB,  v25 TEXT COLLATE BINARY,
  i26 INT,  r26 REAL,  n26 NUMERIC,  b26 BLOB,  v26 TEXT COLLATE BINARY,
  i27 INT,  r27 REAL,  n27 NUMERIC,  b27 BLOB,  v27 TEXT COLLATE BINARY,
  i28 INT,  r28 REAL,  n28 NUMERIC,  b28 BLOB,  v28 TEXT COLLATE BINARY,
  i29 INT,  r29 REAL,  n29 NUMERIC,  b29 BLOB,  v29 TEXT COLLATE BINARY
) WITHOUT ROWID;

CREATE TABLE t2 (id INT PRIMARY KEY,
  i00 INT,  r00 REAL,  n00 NUMERIC,  b00 BLOB,  v00 TEXT,
  i01 INT,  r01 REAL,  n01 NUMERIC,  b01 BLOB,  v01 TEXT,
  i02 INT,  r02 REAL,  n02 NUMERIC,  b02 BLOB,  v02 TEXT,
  i03 INT,  r03 REAL,  n03 NUMERIC,  b03 BLOB,  v03 TEXT,
  i04 INT,  r04 REAL,  n04 NUMERIC,  b04 BLOB,  v04 TEXT,
  i05 INT,  r05 REAL,  n05 NUMERIC,  b05 BLOB,  v05 TEXT,
  i06 INT,  r06 REAL,  n06 NUMERIC,  b06 BLOB,  v06 TEXT,
  i07 INT,  r07 REAL,  n07 NUMERIC,  b07 BLOB,  v07 TEXT,
  i08 INT,  r08 REAL,  n08 NUMERIC,  b08 BLOB,  v08 TEXT,
  i09 INT,  r09 REAL,  n09 NUMERIC,  b09 BLOB,  v09 TEXT,
  i10 INT,  r10 REAL,  n10 NUMERIC,  b10 BLOB,  v10 TEXT,
  i11 INT,  r11 REAL,  n11 NUMERIC,  b11 BLOB,  v11 TEXT,
  i12 INT,  r12 REAL,  n12 NUMERIC,  b12 BLOB,  v12 TEXT,
  i13 INT,  r13 REAL,  n13 NUMERIC,  b13 BLOB,  v13 TEXT,
  i14 INT,  r14 REAL,  n14 NUMERIC,  b14 BLOB,  v14 TEXT,
  i15 INT,  r15 REAL,  n15 NUMERIC,  b15 BLOB,  v15 TEXT,
  i16 INT,  r16 REAL,  n16 NUMERIC,  b16 BLOB,  v16 TEXT,
  i17 INT,  r17 REAL,  n17 NUMERIC,  b17 BLOB,  v17 TEXT,
  i18 INT,  r18 REAL,  n18 NUMERIC,  b18 BLOB,  v18 TEXT,
  i19 INT,  r19 REAL,  n19 NUMERIC,  b19 BLOB,  v19 TEXT,
  i20 INT,  r20 REAL,  n20 NUMERIC,  b20 BLOB,  v20 TEXT COLLATE NOCASE,
  i21 INT,  r21 REAL,  n21 NUMERIC,  b21 BLOB,  v21 TEXT COLLATE NOCASE,
  i22 INT,  r22 REAL,  n22 NUMERIC,  b22 BLOB,  v22 TEXT COLLATE NOCASE,
  i23 INT,  r23 REAL,  n23 NUMERIC,  b23 BLOB,  v23 TEXT COLLATE NOCASE,
  i24 INT,  r24 REAL,  n24 NUMERIC,  b24 BLOB,  v24 TEXT COLLATE NOCASE,
  i25 INT,  r25 REAL,  n25 NUMERIC,  b25 BLOB,  v25 TEXT COLLATE BINARY,
  i26 INT,  r26 REAL,  n26 NUMERIC,  b26 BLOB,  v26 TEXT COLLATE BINARY,
  i27 INT,  r27 REAL,  n27 NUMERIC,  b27 BLOB,  v27 TEXT COLLATE BINARY,
  i28 INT,  r28 REAL,  n28 NUMERIC,  b28 BLOB,  v28 TEXT COLLATE BINARY,
  i29 INT,  r29 REAL,  n29 NUMERIC,  b29 BLOB,  v29 TEXT COLLATE BINARY
) WITHOUT ROWID;

-- Populate the Test tables with Randomized data
WITH R(vi, vr, vt, vn, vb) AS (
    SELECT rnd.ri,
       rnd.rr,
       substr(p.rndText,1,rnd.rl),
       date('1901-01-01','+'||CAST((rnd.ri % 5000) AS TEXT)||' days'),
       randomblob(rnd.rl)
      FROM rnd,p
)
INSERT OR IGNORE INTO t1(id,
  i00, r00, n00, b00, v00,
  i01, r01, n01, b01, v01,
  i02, r02, n02, b02, v02,
  i03, r03, n03, b03, v03,
  i04, r04, n04, b04, v04,
  i05, r05, n05, b05, v05,
  i06, r06, n06, b06, v06,
  i07, r07, n07, b07, v07,
  i08, r08, n08, b08, v08,
  i09, r09, n09, b09, v09,
  i10, r10, n10, b10, v10,
  i11, r11, n11, b11, v11,
  i12, r12, n12, b12, v12,
  i13, r13, n13, b13, v13,
  i14, r14, n14, b14, v14,
  i15, r15, n15, b15, v15,
  i16, r16, n16, b16, v16,
  i17, r17, n17, b17, v17,
  i18, r18, n18, b18, v18,
  i19, r19, n19, b19, v19,
  i20, r20, n20, b20, v20,
  i21, r21, n21, b21, v21,
  i22, r22, n22, b22, v22,
  i23, r23, n23, b23, v23,
  i24, r24, n24, b24, v24,
  i25, r25, n25, b25, v25,
  i26, r26, n26, b26, v26,
  i27, r27, n27, b27, v27,
  i28, r28, n28, b28, v28,
  i29, r29, n29, b29, v29
) SELECT vi,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt
  FROM R,p
 WHERE ((vi / 100) % p.rndAddFreq) = 1
;

WITH R(vi, vr, vt, vn, vb) AS (
    SELECT rnd.ri,
       rnd.rr,
       substr(p.rndText,1,rnd.rl),
       date('1901-01-01','+'||CAST((rnd.ri % 5000) AS TEXT)||' days'),
       randomblob(rnd.rl)
      FROM rnd,p
)
INSERT OR IGNORE INTO t2(id,
  i00, r00, n00, b00, v00,
  i01, r01, n01, b01, v01,
  i02, r02, n02, b02, v02,
  i03, r03, n03, b03, v03,
  i04, r04, n04, b04, v04,
  i05, r05, n05, b05, v05,
  i06, r06, n06, b06, v06,
  i07, r07, n07, b07, v07,
  i08, r08, n08, b08, v08,
  i09, r09, n09, b09, v09,
  i10, r10, n10, b10, v10,
  i11, r11, n11, b11, v11,
  i12, r12, n12, b12, v12,
  i13, r13, n13, b13, v13,
  i14, r14, n14, b14, v14,
  i15, r15, n15, b15, v15,
  i16, r16, n16, b16, v16,
  i17, r17, n17, b17, v17,
  i18, r18, n18, b18, v18,
  i19, r19, n19, b19, v19,
  i20, r20, n20, b20, v20,
  i21, r21, n21, b21, v21,
  i22, r22, n22, b22, v22,
  i23, r23, n23, b23, v23,
  i24, r24, n24, b24, v24,
  i25, r25, n25, b25, v25,
  i26, r26, n26, b26, v26,
  i27, r27, n27, b27, v27,
  i28, r28, n28, b28, v28,
  i29, r29, n29, b29, v29
) SELECT vi,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt,
  vi, vr, vn, vb, vt
  FROM R,p
 WHERE ((vi / 10) % p.rndAddFreq) = 1
;

-- Simply make the chosen parameters visible
SELECT * FROM p;


-- Test JOIN method  -- This is the baddie.
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
  JOIN t2 ON t2.i00 = t1.i00 AND t2.r00 = t1.r00 AND t2.n00 = t1.n00 AND t2.v00 = t1.v00
;

-- Test INTERSECT method
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
INTERSECT
SELECT t2.i00, t2.r00, t2.n00, t2.v00
  FROM t2
;

-- Test WHERE IN (...) method
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
WHERE t1.i00 IN (SELECT t2.i00 FROM t2)
;

-- Test JOIN with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
  JOIN t2 ON t2.i00 = t1.i00 AND t2.r00 = t1.r00 AND t2.n00 = t1.n00 AND t2.v00 = t1.v00
 ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Test INTERSECT with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
INTERSECT
SELECT t2.i00, t2.r00, t2.n00, t2.v00
  FROM t2
 ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Test WHERE IN (...) with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
  FROM t1
WHERE t1.i00 IN (SELECT t2.i00 FROM t2)
 ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Display simply the COUNTs from our random and test tables and the matching
-- values, to ensure the test is valid.
WITH RowCounts(Entity, RowCnt) AS (
    SELECT 'Randomizer', COUNT(*) FROM rnd
    UNION ALL
    SELECT 'Test Table t1', COUNT(*) FROM t1
    UNION ALL
    SELECT 'Test Table t2', COUNT(*) FROM t2
    UNION ALL
    SELECT 't1 : t2 Matches', COUNT(*) FROM t2 WHERE t2.i00 IN (SELECT t1.i00 FROM t1)
)
SELECT * FROM RowCounts
;

-- CleanUp
DROP TABLE rnd;
DROP TABLE t1;
DROP TABLE t2;
DROP TABLE p;



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to