What happens if you create an index on uris(feed_history_id)




Michael D. Black

Senior Scientist

NG Information Systems

Advanced Analytics Directorate



________________________________
From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
behalf of Puneet Kishor [punk.k...@gmail.com]
Sent: Wednesday, September 28, 2011 2:00 PM
To: General Discussion of SQLite Database
Subject: EXT :Re: [sqlite] speeding up FTS4


On Sep 28, 2011, at 1:14 PM, Black, Michael (IS) wrote:

> strftime returns a text representation.  So you didn't really change anything.
>


That's not true at all. I added u_downloaded_on (u_ for unixtime)

CREATE TABLE uris (
    uri_id INTEGER PRIMARY KEY,
    uri TEXT,
    uri_html TEXT,
    uri_content TEXT,
    downloaded_on DATETIME DEFAULT CURRENT_TIMESTAMP,
    feed_history_id INTEGER,
    u_downloaded_on INTEGER
);

sqlite> SELECT u_downloaded_on FROM uris LIMIT 5;
1306450769
1306450769
1306450770
1306450774
1306450776


> You need to use juliandays() as I said.
>
>
>
> And you want a REAL number...not integer...though SQLite doesn't really care 
> what you call it.  It's more for your own reference.
>
>
>
> You just added a bunch more strings increasing the size of your 
> database...ergo it ran slower.
>
>


If I understand correctly, the *size* of the database should not matter. Or, at 
least not matter as much. Imagine a database with one table with only one row 
but with so much content in that row that it is 27 GB vs. a database with a few 
million rows, each with small amount of content but together totaling 27 GB.

What should matter is using the indexes correctly. In this case, my query plan 
shows that I am hitting all the indexes. But, I think I am getting closer to 
understanding this.

I started backward with

sqlite> SELECT p.project_id
   ...> FROM projects p
   ...> WHERE p.project_id = 3;
3
CPU Time: user 0.000080 sys 0.000089

sqlite> SELECT f.feed_id, p.project_id
   ...> FROM projects p JOIN feeds f ON f.project_id = p.project_id
   ...> WHERE p.project_id = 3;
..
CPU Time: user 0.000239 sys 0.000170

SELECT fh.feed_history_id, f.feed_id, p.project_id
   ...> FROM projects p
   ...>   JOIN feeds f ON f.project_id = p.project_id
   ...>   JOIN feed_history fh ON f.feed_id = fh.feed_id
   ...> WHERE p.project_id = 3;
..
CPU Time: user 0.008491 sys 0.008054

SELECT u.uri_id, fh.feed_history_id, f.feed_id, p.project_id
   ...> FROM projects p
   ...>   JOIN feeds f ON f.project_id = p.project_id
   ...>   JOIN feed_history fh ON f.feed_id = fh.feed_id
   ...>   JOIN uris u ON fh.feed_history_id = u.feed_history_id
   ...> WHERE p.project_id = 3;


BOOM! Adding that last table makes my query way too slow.

Taking out the uris table and querying only the fts table

sqlite> SELECT uri_id
   ...> FROM fts_uri
   ...> WHERE fts_uri MATCH 'education school';


starts producing the results immediately, but gives me back way too many 
matches. I am now getting all the matches, but I want only the matches for 
project id = 3 or matches throttled by u_downloaded_on (both of which are 
achievable only via a JOIN with uris table).


>
> Michael D. Black
>
> Senior Scientist
>
> NG Information Systems
>
> Advanced Analytics Directorate
>
>
>
> ________________________________
> From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
> behalf of Puneet Kishor [punk.k...@gmail.com]
> Sent: Wednesday, September 28, 2011 12:44 PM
> To: General Discussion of SQLite Database
> Subject: EXT :Re: [sqlite] speeding up FTS4
>
>
> On Sep 28, 2011, at 11:00 AM, Black, Michael (IS) wrote:
>
>> Your change to numeric date/time may not take a long as you think.
>>
>>
>>
>
>
> Took an hour and a half.
>
> Step 1: Alter all tables with datetime columns, converting those columns to 
> integer;
>
> Step 2: Update all tables setting new datetime columns to unixtime
>
> UPDATE table SET new_column = strftime('%s', old_column);
>
> Step 3: CREATE INDEX uris_downloaded_on ON uris (u_downloaded_on);
>
> Step 4: Run the following query
>
>        SELECT u.uri_id uri_id, u.uri uri, u.u_downloaded_on
>        FROM fts_uri f
>                JOIN uris u ON f.uri_id = u.uri_id
>                JOIN feed_history fh ON u.feed_history_id = fh.feed_history_id
>                JOIN feeds f ON fh.feed_id = f.feed_id
>                JOIN projects p ON f.project_id = p.project_id
>        WHERE p.project_id = 3
>                AND u.u_downloaded_on >= p.u_project_start
>                AND fts_uri MATCH 'education school'
>        ORDER BY u.uri_id, u_downloaded_on DESC;
>
> Terrible time.
>>> CPU Time: user 27.584849 sys 115.219293
>
> Step 5: EXPLAIN QUERY PLAN (above SELECT query)
>
>        0|0|0|SCAN TABLE fts_uri AS f VIRTUAL TABLE INDEX 4: (~0 rows)
>        0|1|1|SEARCH TABLE uris AS u USING INTEGER PRIMARY KEY (rowid=?) (~1 
> rows)
>        0|2|2|SEARCH TABLE feed_history AS fh USING INTEGER PRIMARY KEY 
> (rowid=?) (~1 rows)
>        0|3|3|SEARCH TABLE feeds AS f USING INTEGER PRIMARY KEY (rowid=?) (~1 
> rows)
>        0|4|4|SEARCH TABLE projects AS p USING INTEGER PRIMARY KEY (rowid=?) 
> (~1 rows)
>        0|0|0|USE TEMP B-TREE FOR ORDER BY
>        CPU Time: user 0.000099 sys 0.000008
>
> Step 6: ANALYZE;
>
> Step 7: Run the above SELECT query again
>
> Terrible time.
>>> CPU Time: user 27.703538 sys 116.684390
>
>
>
> This train is going nowhere. The times are actually worse than they were when 
> I was using a non-text column for date time.
>
> By the way, have tried this on two machines -- the previous one was a 
> dual-Xeon Xserve with 12 GB RAM. The current machine of choice is the top of 
> the line iMac (quad core 3.4 GHz Intel Core i7) with 12 GB RAM and a 7200 RPM 
> SATA drive.
>
> ..

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to