Hi List,
I'm trying to find all hashes that are unique to a specific id (my_id),
and then use a UPDATE-Join to update another table with that number.
After much tweaking, I've simplified the table down to a basic temp
table (actually created using a CREATE AS SELECT ... GROUP BY my_id, hash):
CREATE TEMP TABLE t_distinct_hashes_by_id (
my_id INTEGER,
hash TEXT
);
And indexes in both directions because I'm still trying to optimise (and
see what SQLite wants):
CREATE UNIQUE INDEX temp.idx__1 ON t_distinct_hashes_by_id (
hash,
my_id
);
CREATE UNIQUE INDEX temp.idx__2 ON t_distinct_hashes_by_id (
my_id,
hash
);
There are only 20 values for my_id, but several hundred thousand hashes.
-----
I can do a SELECT query which gets what I want and runs in about 0.5
seconds:
SELECT
*
FROM
temp.t_distinct_hashes_by_id d
WHERE
hash NOT IN (SELECT hash FROM temp.t_distinct_hashes_by_id
sub where 1 != sub.my_id and hash not NULL)
AND
1 = d.my_id
The EXPLAIN:
0 0 0 SEARCH TABLE t_distinct_hashes_by_id AS l USING COVERING
INDEX idx__2 (my_id=?)
0 0 0 EXECUTE LIST SUBQUERY 1
1 0 0 SCAN TABLE t_distinct_hashes_by_id AS sub
-----
So in theory, I'd hope that an UPDATE version using this select should
take around 20 * 0.5 sec = 10 seconds. But it's actually taking... well
I don't know how long, at least 10 minutes before I gave up waiting.
This is the UPDATE:
UPDATE
meta_table
SET
distinct_hashes = (
SELECT
COUNT(hash) AS num
FROM
temp.t_distinct_hashes_by_id d
WHERE
hash NOT IN (SELECT hash FROM temp.t_distinct_hashes_by_id
sub where meta_table.my_id != sub.my_id and hash not NULL)
AND
-- This one has to be at the bottom for some reason.
meta_table.my_id = d.my_id
)
;
The query plan for this UPDATE includes two CORRELATED Subqueries, which
the docs say are reassessed on every run - that seems like the problem.
I get that it'd need to do that 20 times (once per my_id), but the
slowdown seems considerably longer than that needs to warrant.
0 0 0 SCAN TABLE meta_table
0 0 0 EXECUTE CORRELATED SCALAR SUBQUERY 0
0 0 0 SEARCH TABLE t_distinct_hashes_by_id AS l USING COVERING
INDEX idx__2 (my_id=?)
0 0 0 EXECUTE CORRELATED LIST SUBQUERY 1
1 0 0 SCAN TABLE t_distinct_hashes_by_id AS sub
Does anyone have any thoughts on how I can speed this up (SQLite 3.15.0
and confirmed in 3.24.0 (which uses about 3 times the disk IO / sec for
the same query/data))?
Thanks,
Jonathan
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users