I want to build an application that can search for specific documents based
on a lot of criteria from a user (e.g. matching keywords, text, etc...).
And then to combine these results using boolean logic ... For example,
keyword 'animal' AND rating > 3
Each document has an integer id.
My strategy so far is to gather the id for each subquery into a temporary
table. Then I can combine the tables using UNION or INTERSECT keywords.
However, when I try a sample that does this on 250,000 rows, I find the
performance is rather slow: 2 seconds. (I've built sqlite3 myself on MacOS)
sqlite3 --version
> 3.8.10.2 2015-05-20 18:17:19 2ef4f3a5b1d1d0c4338f8243d40a2452cc1f7fe4
> cat config.sql intersection.sql | sqlite3
> 250001
> Run Time: real 2.000 user 1.956734 sys 0.044040
>
(samples files included below.)
I wonder if anyone has suggestions for how to improve this approach. For
example, I've considered trying to build a monolithic query which does the
subqueries and UNION/INTERSECTION logic all at once, but my previous
experience has shown that the resulting queries are very complex and hard
for me to reason about if/when they get slow. This approach lets me
profile each subquery easily. It also lets me tackle sorting the results
as a separate step.
Another idea I've toyed with is building a custom implementation of UNION /
INTERSECTION for result sets that are just sets of integers. I could do
this as a virtual table in sqlite.
Thanks for your thoughts on this problem.
Andy
config.sql contains:
CREATE TABLE config( numAssets );
INSERT INTO config VALUES( 250000 );
intersection.sql contains:
CREATE TEMPORARY TABLE idset1 ( id PRIMARY KEY );
CREATE TEMPORARY TABLE idset2 ( id PRIMARY KEY );
CREATE INDEX idset1_id ON idset1 ( id );
CREATE INDEX idset2_id ON idset2 ( id );
BEGIN TRANSACTION;
INSERT INTO idset1
SELECT id FROM ( WITH RECURSIVE
cnt( id ) AS (
VALUES( 0 ) UNION ALL
SELECT id+1 FROM cnt WHERE id < ( SELECT MAX( numAssets ) FROM config
))
select * from cnt ), config;
COMMIT TRANSACTION;
BEGIN TRANSACTION;
INSERT INTO idset2
SELECT id + 1000000 FROM ( WITH RECURSIVE
cnt( id ) AS (
VALUES( 0 ) UNION ALL
SELECT id+1 FROM cnt WHERE id < ( SELECT MAX( numAssets ) FROM config
))
select * from cnt ), config;
COMMIT TRANSACTION;
.timer ON
SELECT count(id) FROM idset1 WHERE id IN ( SELECT id FROM idset1 UNION
SELECT id FROM idset2 );
.timer OFF