Re: [sqlite] speeding up FTS4

Puneet Kishor Wed, 28 Sep 2011 12:54:38 -0700

On Sep 28, 2011, at 2:41 PM, Black, Michael (IS) wrote:

> What happens if you create an index on uris(feed_history_id)
> 
> 
>


Yeah, I noticed that lacking as well.


sqlite> EXPLAIN QUERY PLAN SELECT u.uri_id  
   ...> FROM projects p 
   ...>   JOIN feeds f ON f.project_id = p.project_id 
   ...>   JOIN feed_history fh ON f.feed_id = fh.feed_id 
   ...>   JOIN uris u ON fh.feed_history_id = u.feed_history_id 
   ...> WHERE p.project_id = 3;
0|0|0|SEARCH TABLE projects AS p USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|1|3|SCAN TABLE uris AS u (~46608 rows)
0|2|2|SEARCH TABLE feed_history AS fh USING INTEGER PRIMARY KEY (rowid=?) (~1 
rows)
0|3|1|SEARCH TABLE feeds AS f USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
CPU Time: user 0.000079 sys 0.000014



sqlite> CREATE INDEX uris_feed_history_id ON uris (feed_history_id);
CPU Time: user 12.766977 sys 82.766372


sqlite> EXPLAIN QUERY PLAN SELECT u.uri_id  
   ...> FROM projects p 
   ...>   JOIN feeds f ON f.project_id = p.project_id 
   ...>   JOIN feed_history fh ON f.feed_id = fh.feed_id 
   ...>   JOIN uris u ON fh.feed_history_id = u.feed_history_id 
   ...> WHERE p.project_id = 3;
0|0|0|SEARCH TABLE projects AS p USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|1|1|SCAN TABLE feeds AS f (~7 rows)
0|2|2|SEARCH TABLE feed_history AS fh USING AUTOMATIC COVERING INDEX 
(feed_id=?) (~5 rows)
0|3|3|SEARCH TABLE uris AS u USING COVERING INDEX uris_feed_history_id 
(feed_history_id=?) (~10 rows)
CPU Time: user 0.000123 sys 0.000015


But, no joy.


SELECT u.uri_id uri_id, u.uri uri, u.u_downloaded_on 
FROM fts_uri f 
    JOIN uris u ON f.uri_id = u.uri_id 
    JOIN feed_history fh ON u.feed_history_id = fh.feed_history_id 
    JOIN feeds f ON fh.feed_id = f.feed_id 
    JOIN projects p ON f.project_id = p.project_id 
WHERE p.project_id = 3
    AND u.u_downloaded_on >= p.u_project_start 
    AND fts_uri MATCH 'education school'
ORDER BY u.uri_id, u_downloaded_on DESC;
..
CPU Time: user 28.599581 sys 108.518205




> 
> 
> 
> ________________________________
> From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
> behalf of Puneet Kishor [punk.k...@gmail.com]
> Sent: Wednesday, September 28, 2011 2:00 PM
> To: General Discussion of SQLite Database
> Subject: EXT :Re: [sqlite] speeding up FTS4
> 
> 
> On Sep 28, 2011, at 1:14 PM, Black, Michael (IS) wrote:
> 
>> strftime returns a text representation.  So you didn't really change 
>> anything.
>> 
> 
> 
> That's not true at all. I added u_downloaded_on (u_ for unixtime)
> 
> CREATE TABLE uris (
>    uri_id INTEGER PRIMARY KEY,
>    uri TEXT,
>    uri_html TEXT,
>    uri_content TEXT,
>    downloaded_on DATETIME DEFAULT CURRENT_TIMESTAMP,
>    feed_history_id INTEGER,
>    u_downloaded_on INTEGER
> );
> 
> sqlite> SELECT u_downloaded_on FROM uris LIMIT 5;
> 1306450769
> 1306450769
> 1306450770
> 1306450774
> 1306450776
> 
> 
>> You need to use juliandays() as I said.
>> 
>> 
>> 
>> And you want a REAL number...not integer...though SQLite doesn't really care 
>> what you call it.  It's more for your own reference.
>> 
>> 
>> 
>> You just added a bunch more strings increasing the size of your 
>> database...ergo it ran slower.
>> 
>> 
> 
> 
> If I understand correctly, the *size* of the database should not matter. Or, 
> at least not matter as much. Imagine a database with one table with only one 
> row but with so much content in that row that it is 27 GB vs. a database with 
> a few million rows, each with small amount of content but together totaling 
> 27 GB.
> 
> What should matter is using the indexes correctly. In this case, my query 
> plan shows that I am hitting all the indexes. But, I think I am getting 
> closer to understanding this.
> 
> I started backward with
> 
> sqlite> SELECT p.project_id
>   ...> FROM projects p
>   ...> WHERE p.project_id = 3;
> 3
> CPU Time: user 0.000080 sys 0.000089
> 
> sqlite> SELECT f.feed_id, p.project_id
>   ...> FROM projects p JOIN feeds f ON f.project_id = p.project_id
>   ...> WHERE p.project_id = 3;
> ..
> CPU Time: user 0.000239 sys 0.000170
> 
> SELECT fh.feed_history_id, f.feed_id, p.project_id
>   ...> FROM projects p
>   ...>   JOIN feeds f ON f.project_id = p.project_id
>   ...>   JOIN feed_history fh ON f.feed_id = fh.feed_id
>   ...> WHERE p.project_id = 3;
> ..
> CPU Time: user 0.008491 sys 0.008054
> 
> SELECT u.uri_id, fh.feed_history_id, f.feed_id, p.project_id
>   ...> FROM projects p
>   ...>   JOIN feeds f ON f.project_id = p.project_id
>   ...>   JOIN feed_history fh ON f.feed_id = fh.feed_id
>   ...>   JOIN uris u ON fh.feed_history_id = u.feed_history_id
>   ...> WHERE p.project_id = 3;
> 
> 
> BOOM! Adding that last table makes my query way too slow.
> 
> Taking out the uris table and querying only the fts table
> 
> sqlite> SELECT uri_id
>   ...> FROM fts_uri
>   ...> WHERE fts_uri MATCH 'education school';
> 
> 
> starts producing the results immediately, but gives me back way too many 
> matches. I am now getting all the matches, but I want only the matches for 
> project id = 3 or matches throttled by u_downloaded_on (both of which are 
> achievable only via a JOIN with uris table).
> 
> 
>> 
>> Michael D. Black
>> 
>> Senior Scientist
>> 
>> NG Information Systems
>> 
>> Advanced Analytics Directorate
>> 
>> 
>> 
>> ________________________________
>> From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
>> behalf of Puneet Kishor [punk.k...@gmail.com]
>> Sent: Wednesday, September 28, 2011 12:44 PM
>> To: General Discussion of SQLite Database
>> Subject: EXT :Re: [sqlite] speeding up FTS4
>> 
>> 
>> On Sep 28, 2011, at 11:00 AM, Black, Michael (IS) wrote:
>> 
>>> Your change to numeric date/time may not take a long as you think.
>>> 
>>> 
>>> 
>> 
>> 
>> Took an hour and a half.
>> 
>> Step 1: Alter all tables with datetime columns, converting those columns to 
>> integer;
>> 
>> Step 2: Update all tables setting new datetime columns to unixtime
>> 
>> UPDATE table SET new_column = strftime('%s', old_column);
>> 
>> Step 3: CREATE INDEX uris_downloaded_on ON uris (u_downloaded_on);
>> 
>> Step 4: Run the following query
>> 
>>       SELECT u.uri_id uri_id, u.uri uri, u.u_downloaded_on
>>       FROM fts_uri f
>>               JOIN uris u ON f.uri_id = u.uri_id
>>               JOIN feed_history fh ON u.feed_history_id = fh.feed_history_id
>>               JOIN feeds f ON fh.feed_id = f.feed_id
>>               JOIN projects p ON f.project_id = p.project_id
>>       WHERE p.project_id = 3
>>               AND u.u_downloaded_on >= p.u_project_start
>>               AND fts_uri MATCH 'education school'
>>       ORDER BY u.uri_id, u_downloaded_on DESC;
>> 
>> Terrible time.
>>>> CPU Time: user 27.584849 sys 115.219293
>> 
>> Step 5: EXPLAIN QUERY PLAN (above SELECT query)
>> 
>>       0|0|0|SCAN TABLE fts_uri AS f VIRTUAL TABLE INDEX 4: (~0 rows)
>>       0|1|1|SEARCH TABLE uris AS u USING INTEGER PRIMARY KEY (rowid=?) (~1 
>> rows)
>>       0|2|2|SEARCH TABLE feed_history AS fh USING INTEGER PRIMARY KEY 
>> (rowid=?) (~1 rows)
>>       0|3|3|SEARCH TABLE feeds AS f USING INTEGER PRIMARY KEY (rowid=?) (~1 
>> rows)
>>       0|4|4|SEARCH TABLE projects AS p USING INTEGER PRIMARY KEY (rowid=?) 
>> (~1 rows)
>>       0|0|0|USE TEMP B-TREE FOR ORDER BY
>>       CPU Time: user 0.000099 sys 0.000008
>> 
>> Step 6: ANALYZE;
>> 
>> Step 7: Run the above SELECT query again
>> 
>> Terrible time.
>>>> CPU Time: user 27.703538 sys 116.684390
>> 
>> 
>> 
>> This train is going nowhere. The times are actually worse than they were 
>> when I was using a non-text column for date time.
>> 
>> By the way, have tried this on two machines -- the previous one was a 
>> dual-Xeon Xserve with 12 GB RAM. The current machine of choice is the top of 
>> the line iMac (quad core 3.4 GHz Intel Core i7) with 12 GB RAM and a 7200 
>> RPM SATA drive.
>> 
>> ..
> 

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] speeding up FTS4

Reply via email to