In our application, we have an FTS4 table 'tmexpandedtext' that contains
all our text strings. Other tables have columns like fullname_id, path_id
and filename_id that point to strings in this table by their docid. The
issue we're seeing is that joining on the docid column seems really slow
compared to doing the join on an otherwise equivalent non-FTS4 table.

My question is twofold:

Am I wrong to think that joining on docid should be as fast as joining on
indexed integer columns in other tables? And if I'm indeed wrong about
that, what's the recommended way to speed things up?

Here's a transcript to show what's going on:

sqlite> .timer on
sqlite> .output foo.txt
sqlite> SELECT msg.tsc, msg.flags FROM tmmessages msg LEFT JOIN
tmexpandedtext etext ON etext.docid=msg.fullname_id LEFT JOIN
tmexpandedtext ptext
ON ptext.docid=msg.path_id LEFT JOIN tmexpandedtext ftext ON
ftext.docid=msg.filename_id;
CPU Time: user 31.106599 sys 0.062400
sqlite> .output stdout
sqlite> explain query plan SELECT msg.tsc, msg.flags FROM tmmessages msg
LEFT JOIN tmexpandedtext etext ON etext.docid=msg.fullname_id LEFT JOIN
tmexpandedtext ptext ON ptext.docid=msg.path_id LEFT JOIN tmexpandedtext
ftext ON ftext.docid=msg.filename_id;
0|0|0|SCAN TABLE tmmessages AS msg (~1000000 rows)
0|1|1|SCAN TABLE tmexpandedtext AS etext VIRTUAL TABLE INDEX 1: (~0 rows)
0|2|2|SCAN TABLE tmexpandedtext AS ptext VIRTUAL TABLE INDEX 1: (~0 rows)
0|3|3|SCAN TABLE tmexpandedtext AS ftext VIRTUAL TABLE INDEX 1: (~0 rows)
CPU Time: user 0.000000 sys 0.000000

The first scan is okay since in this example we're grabbing everything in
the tmmessages and resolving the three id columns to the corresponding
strings in tmexpandedtext. Normally I would say the three other scans are
suspicious, but from my reading I've been let to believe that a SCAN TABLE
with VIRTUAL TABLE INDEX is the moral equivalent of a SEARCH TABLE on a
non-FTS4 table. However, the slowness of the query seems to say otherwise.

For reference, here are the sizes of the two relevant tables:

sqlite> select count() from tmexpandedtext;
70154
CPU Time: user 0.031200 sys 0.000000
sqlite> select count() from tmmessages;
1100006
CPU Time: user 0.000000 sys 0.031200

Now let's create a non-FTS4 equivalent of the tmmessages table and re-do
the query with it:

sqlite> create table strings (content VARCHAR, docid INTEGER);
CPU Time: user 0.000000 sys 0.000000
sqlite> create index strings_docid ON strings(docid);
CPU Time: user 0.000000 sys 0.000000
sqlite> INSERT INTO strings SELECT content, docid FROM tmexpandedtext;
CPU Time: user 0.187201 sys 0.000000
sqlite> SELECT msg.tsc, msg.flags FROM tmmessages msg LEFT JOIN strings
etext ON etext.docid=msg.fullname_id LEFT JOIN strings ptext ON
ptext.docid=msg.path_id LEFT JOIN strings ftext ON
ftext.docid=msg.filename_id;
CPU Time: user 3.572423 sys 0.031200

As you can see, it is much faster and more in line with what you'd expect.
Here is the query plan:

sqlite> explain query plan SELECT msg.tsc, msg.flags FROM tmmessages msg
LEFT JOIN strings etext ON etext.docid=msg.fullname_id LEFT JOIN strings
ptext ON ptext.docid=msg.path_id LEFT JOIN strings ftext ON
ftext.docid=msg.filename_id;
0|0|0|SCAN TABLE tmmessages AS msg (~1000000 rows)
0|1|1|SEARCH TABLE strings AS etext USING COVERING INDEX strings_docid
(docid=?) (~10 rows)
0|2|2|SEARCH TABLE strings AS ptext USING COVERING INDEX strings_docid
(docid=?) (~10 rows)
0|3|3|SEARCH TABLE strings AS ftext USING COVERING INDEX strings_docid
(docid=?) (~10 rows)
CPU Time: user 0.000000 sys 0.000000

Any advice or insight would be greatly appreciated!

Sincerely,
Per Vognsen
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to